Lab 5: Text Analysis & NLP - Word Co-occurrence Analysis

library(tidyverse)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(Matrix)
library(checkdown)

Word Co-occurrence Analysis

This analysis uses a dataset I scraped previously, namely just the messages on my own BlueSky account. This is an example of a small dataset that is prone to overfitting and correlations that do not generalize. This happens when words may appear to be related because there are only a couple messages, but not because of a genuine connection that would become apparent with more messages.

Step 1: Load the Raw Text

spacex_lines <- readLines("spacex_cooccurrence.txt", encoding = "UTF-8")
gsub("^\\s+|\\s+$", "", spacex_lines, useBytes = TRUE)

##  [1] "MSBA Week 2 Reflection"                                                            
##  [2] ""                                                                                  
##  [3] "a. I learned that factor functions in R can help with interpretability of factors."
##  [4] "b. I would like to learn more about coding different chart types in R."            
##  [5] "c. I find interesting how easy R makes data analysis."                             
##  [6] "d. A concern is occasional errors in typed code."                                  
##  [7] ""                                                                                  
##  [8] "#analyticsninja26"                                                                 
##  [9] "MSBA 580 Week 1 Reflection"                                                        
## [10] ""                                                                                  
## [11] "a. Today I learned how valuable it is professionally to know both Python and R."   
## [12] "b. I want to learn more about what makes R so effective for statistical analysis." 
## [13] "c. I find interesting learning how programming languages work."                    
## [14] "d. No concerns currently."                                                         
## [15] ""                                                                                  
## [16] "#analyticsninja26"

spacex_text <- tibble(
  id = seq_along(spacex_lines),
  text = spacex_lines
)

spacex_text

## # A tibble: 16 × 2
##       id text                                                                   
##    <int> <chr>                                                                  
##  1     1 "MSBA Week 2 Reflection"                                               
##  2     2 ""                                                                     
##  3     3 "a. I learned that factor functions in R can help with interpretabilit…
##  4     4 "b. I would like to learn more about coding different chart types in R…
##  5     5 "c. I find interesting how easy R makes data analysis."                
##  6     6 "d. A concern is occasional errors in typed code."                     
##  7     7 ""                                                                     
##  8     8 "#analyticsninja26"                                                    
##  9     9 "MSBA 580 Week 1 Reflection"                                           
## 10    10 ""                                                                     
## 11    11 "a. Today I learned how valuable it is professionally to know both Pyt…
## 12    12 "b. I want to learn more about what makes R so effective for statistic…
## 13    13 "c. I find interesting learning how programming languages work."       
## 14    14 "d. No concerns currently."                                            
## 15    15 ""                                                                     
## 16    16 "#analyticsninja26"

Each line in the file is treated as its own “document” (id), similar to how each sentence was its own document in the fruit example.

Step 2: Tokenize and Clean

custom_stop_words <- bind_rows(
  stop_words,
  tibble(word = c("im", "ive", "dont", "didnt", "isnt", "stock", "lol", "say", "typed", "I", "in", "it", "is"), lexicon = "custom")
)

spacex_words <- spacex_text %>%
  mutate(text = str_replace_all(text, "[^[:alnum:][:space:]$]", " ")) %>%
  unnest_tokens(word, text, token = "words") %>%
  filter(!str_detect(word, "^[0-9]+$")) %>%   # drop pure numbers
  anti_join(custom_stop_words, by = "word")

spacex_words %>% count(word, sort = TRUE) %>% head(15)

## # A tibble: 15 × 2
##    word                 n
##    <chr>            <int>
##  1 analysis             2
##  2 analyticsninja26     2
##  3 learn                2
##  4 learned              2
##  5 makes                2
##  6 msba                 2
##  7 reflection           2
##  8 week                 2
##  9 chart                1
## 10 code                 1
## 11 coding               1
## 12 concern              1
## 13 concerns             1
## 14 data                 1
## 15 easy                 1

Step 3: Count Word Pairs

spacex_pairs <- spacex_words %>%
  pairwise_count(word, id, sort = TRUE, upper = FALSE)

spacex_pairs %>% head(15)

## # A tibble: 15 × 3
##    item1            item2                n
##    <chr>            <chr>            <dbl>
##  1 msba             week                 2
##  2 msba             reflection           2
##  3 week             reflection           2
##  4 makes            analysis             2
##  5 learned          factor               1
##  6 learned          functions            1
##  7 factor           functions            1
##  8 learned          interpretability     1
##  9 factor           interpretability     1
## 10 functions        interpretability     1
## 11 learned          factors              1
## 12 factor           factors              1
## 13 functions        factors              1
## 14 interpretability factors              1
## 15 learn            coding               1

Step 4: Build the Co-occurrence Matrix

We’ll restrict the matrix to the top 15 most frequent words so it stays readable.

top_words <- spacex_words %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 15) %>%
  pull(word)

spacex_matrix <- spacex_pairs %>%
  filter(item1 %in% top_words, item2 %in% top_words) %>%
  bind_rows(spacex_pairs %>% rename(item1 = item2, item2 = item1) %>%
              filter(item1 %in% top_words, item2 %in% top_words)) %>%
  cast_sparse(item1, item2, n) %>%
  as.matrix()

spacex_matrix

##                  week reflection analysis factor functions interpretability
## msba                2          2        0      0         0                0
## week                0          2        0      0         0                0
## makes               0          0        2      0         0                0
## learned             0          0        0      1         1                1
## factor              0          0        0      0         1                1
## functions           0          0        0      1         0                1
## interpretability    0          0        0      1         1                0
## learn               0          0        1      0         0                0
## coding              0          0        0      0         0                0
## chart               0          0        0      0         0                0
## easy                0          0        1      0         0                0
## data                0          0        1      0         0                0
## concern             0          0        0      0         0                0
## occasional          0          0        0      0         0                0
## errors              0          0        0      0         0                0
## valuable            0          0        0      0         0                0
## professionally      0          0        0      0         0                0
## analysis            0          0        0      0         0                0
## effective           0          0        1      0         0                0
## learning            0          0        0      0         0                0
## programming         0          0        0      0         0                0
## reflection          2          0        0      0         0                0
## factors             0          0        0      1         1                1
## types               0          0        0      0         0                0
## code                0          0        0      0         0                0
## python              0          0        0      0         0                0
## statistical         0          0        1      0         0                0
## languages           0          0        0      0         0                0
##                  factors coding chart types makes data occasional errors code
## msba                   0      0     0     0     0    0          0      0    0
## week                   0      0     0     0     0    0          0      0    0
## makes                  0      0     0     0     0    1          0      0    0
## learned                1      0     0     0     0    0          0      0    0
## factor                 1      0     0     0     0    0          0      0    0
## functions              1      0     0     0     0    0          0      0    0
## interpretability       1      0     0     0     0    0          0      0    0
## learn                  0      1     1     1     1    0          0      0    0
## coding                 0      0     1     1     0    0          0      0    0
## chart                  0      1     0     1     0    0          0      0    0
## easy                   0      0     0     0     1    1          0      0    0
## data                   0      0     0     0     1    0          0      0    0
## concern                0      0     0     0     0    0          1      1    1
## occasional             0      0     0     0     0    0          0      1    1
## errors                 0      0     0     0     0    0          1      0    1
## valuable               0      0     0     0     0    0          0      0    0
## professionally         0      0     0     0     0    0          0      0    0
## analysis               0      0     0     0     2    1          0      0    0
## effective              0      0     0     0     1    0          0      0    0
## learning               0      0     0     0     0    0          0      0    0
## programming            0      0     0     0     0    0          0      0    0
## reflection             0      0     0     0     0    0          0      0    0
## factors                0      0     0     0     0    0          0      0    0
## types                  0      1     1     0     0    0          0      0    0
## code                   0      0     0     0     0    0          1      1    0
## python                 0      0     0     0     0    0          0      0    0
## statistical            0      0     0     0     1    0          0      0    0
## languages              0      0     0     0     0    0          0      0    0
##                  valuable professionally python effective statistical
## msba                    0              0      0         0           0
## week                    0              0      0         0           0
## makes                   0              0      0         1           1
## learned                 1              1      1         0           0
## factor                  0              0      0         0           0
## functions               0              0      0         0           0
## interpretability        0              0      0         0           0
## learn                   0              0      0         1           1
## coding                  0              0      0         0           0
## chart                   0              0      0         0           0
## easy                    0              0      0         0           0
## data                    0              0      0         0           0
## concern                 0              0      0         0           0
## occasional              0              0      0         0           0
## errors                  0              0      0         0           0
## valuable                0              1      1         0           0
## professionally          1              0      1         0           0
## analysis                0              0      0         1           1
## effective               0              0      0         0           1
## learning                0              0      0         0           0
## programming             0              0      0         0           0
## reflection              0              0      0         0           0
## factors                 0              0      0         0           0
## types                   0              0      0         0           0
## code                    0              0      0         0           0
## python                  1              1      0         0           0
## statistical             0              0      0         1           0
## languages               0              0      0         0           0
##                  programming languages msba learned learn easy concern learning
## msba                       0         0    0       0     0    0       0        0
## week                       0         0    2       0     0    0       0        0
## makes                      0         0    0       0     1    1       0        0
## learned                    0         0    0       0     0    0       0        0
## factor                     0         0    0       1     0    0       0        0
## functions                  0         0    0       1     0    0       0        0
## interpretability           0         0    0       1     0    0       0        0
## learn                      0         0    0       0     0    0       0        0
## coding                     0         0    0       0     1    0       0        0
## chart                      0         0    0       0     1    0       0        0
## easy                       0         0    0       0     0    0       0        0
## data                       0         0    0       0     0    1       0        0
## concern                    0         0    0       0     0    0       0        0
## occasional                 0         0    0       0     0    0       1        0
## errors                     0         0    0       0     0    0       1        0
## valuable                   0         0    0       1     0    0       0        0
## professionally             0         0    0       1     0    0       0        0
## analysis                   0         0    0       0     1    1       0        0
## effective                  0         0    0       0     1    0       0        0
## learning                   1         1    0       0     0    0       0        0
## programming                0         1    0       0     0    0       0        1
## reflection                 0         0    2       0     0    0       0        0
## factors                    0         0    0       1     0    0       0        0
## types                      0         0    0       0     1    0       0        0
## code                       0         0    0       0     0    0       1        0
## python                     0         0    0       1     0    0       0        0
## statistical                0         0    0       0     1    0       0        0
## languages                  1         0    0       0     0    0       0        1

Step 5: Visualize the Strongest Co-occurrences

To keep the network readable, we’ll only keep pairs that occurred together more than once.

set.seed(580)

spacex_pairs %>%
  filter(n > 1) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), color = "firebrick") +
  geom_node_point(size = 4, color = "steelblue") +
  geom_node_text(aes(label = name), repel = TRUE, size = 3.5) +
  theme_void() +
  labs(
    title = "Word Co-occurrence Network: My BlueSky Account",
    subtitle = "Edge thickness = number of sentences containing both words"
  )

Reflection Questions

Question 1 Which additional words did you add to the stopword list manually? Why does it make sense to remove those words as stopwords? Also, does the final network graph (see the last code chunk) change after removing these additional stopwords? If so, please briefly explain how and why.**

I manually added the words “say”, “typed”, “I”, “in”, “it”, and “is,” because I felt these words were less meaningful and did not point to anything specific or meaningful about the meaning of the posts–they could have referred to anything. Especially the last four were simply functianal parts of constructing a sentence, rather than descriptors of anything specific to the content. Of course it is possible depending on the circumstance or what one is analyzing, they may be appropriate. For example, typed has more relevance in relation to “code,” if one is interested in understanding differences between AI and manual human coding.

The final network graph did not change. This may have been because the sample size was very small (only 2 posts), so there were not enough instances or co-occurrences to produce a signifcant effect on the graph.

Question 2 Change “n” in the function, “filter(n > 1)”, in step 5 to a different value. What is your final value of n? Will it give you a better result?

Changing the function to n > 2 gave a worse result. The reason was the filter looks for word pairs that occur together in more than n documents. Because I had only two posts on my account, setting n greater than two resulted in no co-occurrences. However, setting n > 0 led to many more nodes, obviously, since that would be the minimum threshhold, and setting n > 1 produced only five nodes, representing the word pairs that occurred together in more than one most, i.e., in my dataset that would be a maximum of two posts.

Additional Discussion Questions

Question 3 On my BlueSky post here: https://bsky.app/profile/did:plc:jbmmqoxfdgpoavycm7fz4k3r/post/3mphyccr2vc2g, I had two co-occurence network graphs. Which one do you like better? Why?

I like the second graph because it has much fewer nodes. This means it reveals the tightest semantic correlations, because the filter n must have been set higher. The higher n, the more times the words must show up together in multiple posts (Chiericato, 2015)

Question 4 Which word pairs surprised you? Did “buy” cluster more with optimism words (“dip”) or caution words (“wait,” “valuation”)?

I was not particularly surprised by any of the word pairs, which is understandable because with such as small dataset, the correlations were likely to be obvious and not have the noise that would arise from a larger dataset. However, if I had to pick a word pair that humored me, it would be “code” and “errors”–because errors and bugs in code are a frequent annoyance.

Question 5 Should we treat stock tickers, $SPCX and SpaceX, as the same token instead of separate ones?

We should treat the stock tickers differently depending on the analysis and research we are doing (Jurafsky, 2023; Vanderwende, 2013). If we are looking merely about identifying one meaning, the underlying company (SpaceX), then treating them as the same token makes sense. However, if we are looking at sentiment analysis such as looking at different conversations wehere they come up, they may each be used differently in different contexts. In that case, it makes more sense to treat them as separate (Chen, 2014).

Question 6 What’s the risk of drawing conclusions from a co-occurrence network built on only a small sample of comments? What is a good sample size? How would you communicate that limitation in a business report?

The risk of drawing conclulsions from a co-occurrence network built on a small sample size is overfitting to that dataset. Looking at larger datasets through the same lens would result in great variance because many other comments would be drastically different in size and co-occurrence of words. In a small dataset, words may appear highly connected merely because they happened to co-occur in a small number of comments–rather than because they are genuinely connected (Manning, 2008).

In a business report, I would communicate this limitation by saying: the co‑occurrence network in this report is based on a limited sample of comments. As a result, the relationships between terms may reflect the specific language patterns of this small group rather than the broader customer base. These findings should be interpreted as exploratory rather than definitive, and additional data would be required to confirm whether the patterns generalize.

Question 7 Is there a way to analyze comments or reviews on topics that interest you in relation to a significant event? If so, how would you identify the event? Why is that event significant, and how might it influence the comments or reviews you analyze?

Yes, comments and reviews can be analyzed in relation to a significant event by aligning the text data to the event’s timeline. To do this, we identify an event that is time‑bounded and relevant to the topic — for example, a major SpaceX launch, a valuation announcement, or a regulatory decision. Such events are significant because they change the context in which people comment, often shifting sentiment, vocabulary, and co‑occurrence patterns (Jurafsky, 2023). By comparing comments before and after the event, we can observe how public perception responds to real‑world developments.

References

Chen, H., De, P., Hu, Y., & Hwang, B. (2014). Wisdom of Crowds: The Value of Stock Opinions Transmitted Through Social Media. Review of Financial Studies.

Chiericato, E. (2015). Co Citations and Co Occurrences in SEO: The Complete Guide. Web Marketing Academy. https://webmarketing.academy/en/co-citazioni-co-occorrenze-guida-definitiva/

Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft).

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

Vanderwende, L, Daume, H & Kirchoff, K. (Jun. 2013). Proceedings of the 2013 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. https://aclanthology.org/N13-1000/