Core Language Study Methodologies

Inspired by 758 Microsoft Perceptions Landscape

Published

July 24, 2024

1 Introduction

1.1 Background

Microsoft initially approached us looking to investigate user perceptions for the Microsoft brand and its core product suite. What we initially thought would be a straight forward parts-of-speech analysis, comprising of adjective, verb and noun top terms, bigrams and term context networks turned into a deeper analysis of language.

The resulting methodology developed for this project has formed what we now hope to offer in future as a “Core Language Study”. You can look at the delivered deck for more context.

1.2 Objectives

This document will aim to walk-through the developed methodologies and highlight the catches I came across in the process. I will also try to point out areas that I think we could dig deeper into in future projects.

2 Dealing with Large Datasets

Before moving onto the methods, I want to spend a minute talking about how to efficiently process large scale data, without this, the project would have been impossible. Not taking small samples is important here as later we are going to filter to only posts that use the subject of interest (product / brand …) as the nominal subject of the sentence (more on this later), and this can reduce the data we are working with quite significantly.

It may not be the case for all datasets, but the volume of data collected for Microsoft and its core products was in the hundreds of thousands - not exactly what my laptop wants to deal with on a Monday morning. At first glance it may seem that everything is working ok when you load the data into R but as you start to create mutations of the original dataframes and perform functions like spam_grams on the data, things quickly begin to go sour.

Here are my top tips for dealing with this.

2.1 Use Arrow and Duckdb

The R Arrow library allows us to load data into R stored in columnar format and pass that data to and from duckdb tables. There are a subset of dplyr functions compatible with arrow data and a larger subset compatible with duckdb tables. This allows us do a lot of the work we need to do without loading all of our data into RAM.

What is Arrow?

Arrow stores in-memory data in columnar format, which is optimised for both storage size and computation speed. Essentially, it reduces the burden on RAM and accelerates computations.

What is DuckDB?

Duckdb is a database management system designed for efficient querying. The R Arrow package provides a convenient function to_duckdb() which allows us to create a virtual table in DuckDB, a format that is compatible with a wider range of dplyr functions.

This is a good resource for understanding how arrow and duckdb work together to process large data.

Unfortunately, we can’t read in xlsx using arrow and assuming that is the original format of our data, we would first have to read in the xlsx files using readxl and then save as either a parquet or csv file, both which can be read in using arrow. This isn’t ideal but, unless the dataset is in the millions, simply reading in data and resaving it in a different format, while time consuming, likely won’t send your hard drive into shutdown. If you want to be on the safe side and avoid loading that amount of data into RAM, I would suggest reading the xlsx files, and resaving them in an appropriate format, in small chunks

A better solution would be for the analyst exporting the data to export it as a csv - if you read this in time, you should ask them to.

If the data is in csv format and all saved in the same folder, we can read it all in with arrow using:

library(arrow)

file <- "data/csv_file"

data <- open_dataset(file, format = "csv", newlines_in_values = T) 

data %>% class()

[1] "FileSystemDataset" "Dataset"           "ArrowObject"      
[4] "R6"

We can use many dplyr functions as normal:

data %>% nrow()

[1] 200

data %>% colnames()

 [1] "universal_message_id"        "created_time"               
 [3] "sender_screen_name"          "title"                      
 [5] "message"                     "message_type"               
 [7] "sentiment"                   "social_network"             
 [9] "permalink"                   "language"                   
[11] "sender_user_id"              "sender_followers_count"     
[13] "sender_listed_name"          "retweets"                   
[15] "favorites"                   "conversation_id"            
[17] "parent_universal_message_id" "message_clean"              
[19] ".document%%num_cuts"         "words_count"

If we want to collect a table output from our computation, we will need to call collect()

data %>% slice_sample(n = 3) %>% collect()

# A tibble: 3 × 20
  universal_message_id      created_time        sender_screen_name title message
  <chr>                     <dttm>              <chr>              <chr> <chr>  
1 TWITTER_7_17931018032168… 2024-05-22 02:09:18 casamassacre       NA    "@drvo…
2 WEB_98_45c9f164bbbcc8fcc… 2023-11-12 19:28:00 Edsontje           Code… "Code …
3 TWITTER_7_17204477552465… 2023-11-03 14:28:03 ntropy_dev         NA    "This …
# ℹ 15 more variables: message_type <chr>, sentiment <chr>,
#   social_network <chr>, permalink <chr>, language <chr>,
#   sender_user_id <chr>, sender_followers_count <int>,
#   sender_listed_name <chr>, retweets <int>, favorites <int>,
#   conversation_id <chr>, parent_universal_message_id <chr>,
#   message_clean <chr>, `.document%%num_cuts` <int>, words_count <int>

A full list of compatible dplyr functions can be found here.

If you want to use even more dplyr functions that aren’t available in that list, including some stringr functions, you can use the arrow function to_duckdb() to create a DuckDB table.

set.seed(12)

data %>% 
  to_duckdb() %>%
  filter(str_detect(message, "copilot")) %>% 
  slice_sample(n = 5) %>%
  pull(message_clean)

[1] "the decision to adopt github copilot was actually spearheaded by our ceo who challenged every department to explore how ai could enhance our workflows as an existing github customer the integration and onboarding process was seamless githubs sales team proactively reached out to us offering dedicated training sessions to get our engineers up to speed on github copilots capabilities this lowfriction approach was key we were able to trial the tool with a small group of developers before committing to a full rollout what we found was that github copilot excels in certain areas of our development process for greenfield projects or ideation around new features github copilot is invaluable in quickly prototyping code and providing helpful boilerplate its ability to draft coherent documentation has also been a revelation taking a tedious task off our engineers plates and when exploring how to implement something in a new language or framework github copilots contextual code suggestions save us time versus scouring stack overflow and scouring the internet of course github copilot also has its limitations for deeply complex existing codebases or intricate debugging tasks we find its less reliable the tool cant fully replace the nuanced judgment and problemsolving skills of our seasoned developers you really need to review the output from github copilot in this case actually be extra cautious and skeptical of it and then make necessary adjustments and corrections having said that as a forcemultiplier on mundane coding activities it has definitely unlocked new levels of productivity and focus for our team these kinds of productivity gains however are difficult to quantify so it is a subjective judgment and i would look at it as the new possibilities and capabilities gained that were previously not available to the team generative ai has come a long way and there is a lot of transformative potential of these technologies when applied to software development our engineers have enthusiastically embraced it and im excited to see how we can continue leveraging github copilot to work smarter in summary the decision to adopt github copilot was a nobrainer with our existing github relationship the low barrier to entry and the clear productivity gains it was a natural fit for our organization"
[2] "the vision is that copilot becomes the go to ai for every windows user because you only have to hit one button to open it up"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
[3] "this is whats difficult for me to gauge parts of ai for sure have value github copilot i think does anecdotally ive heard that google and meta i think do but were waiting on the killer app nobody knows what it is yet my concern is not the long term usefulness of the"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[4] "you dogpiled on robtop for using ai for what its actually meant to copilots not even that big of a thing and now youre telling this girl what she should feel and what she shouldnt feel bro what"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[5] "when neuralink glitches the ai copilot takes control welcome to the future of autopilot"

Converting back to arrow?

If saving the duckdb output as a variable in memory, it is good practice to use to_arrow() to convert the duckdb output back to an arrow object after the computation has been completed.

While still orders of magnitude smaller than loading the data in as a csv, storing the data as a duckdb object does occupy more memory than an arrow object.

object.size(data)

504 bytes

data_duckdb <- data %>% to_duckdb()
object.size(data_duckdb)

40824 bytes

2.2 Use Parallel Processing when Appropriate

The main area I used parallel process here (where the option was not already built in as a function argument) was in spam and stop word removal. Stop word removal is not dependent on the contents of any other post, so this is straightforward, our current spam removal function, limpiar_spam_grams, is based on word-gram frequency across posts and so we need to be careful with how we change our parameters based on how we split our data across cores.

Removing spam is a particularly slow process, and I found the fastest way to do it was to only load small sections of my data into memory at a time and process that small section in parallel before loading in and processing the next section. I suspect this is dependent on data size and is not necessary for smaller datasets.

First plan the parallel sessions

used_cores <- as.numeric(future::availableCores() - 1)
future::plan(future::multisession, workers = used_cores)

Set the number of groups you want to pull from the data and add a rowid column to the data so we can track this. I am numbering the data according to created time so that any spam posts that might be released for a similar event is loaded in together.

n_retrievals <- 4 # number of groups to pull from database
split_length <- ceiling((data %>% nrow())/n_retrievals)

data <- data %>%
  arrange(created_time) %>%
  to_duckdb() %>%
  mutate(rowid = row_number()) %>%
  to_arrow()

We can now loop through the data, performing spam grams on smaller sections of the dataset.

spam_grams <- list()
spam <- list()

for (i in 1:n_retrievals){
  
  print(i) # track progress
  
  # load in section of data
  df_section <- data %>% 
    filter(rowid > (i-1)*split_length + 1, rowid <= i*split_length) 
  
  # split loaded data according to planned cores
  df_group <- df_section %>%
    collect() %>%
    mutate(split_group = rep(1:used_cores, 
                           each = ceiling(nrow(.)/used_cores), 
                           length.out = nrow(.))) %>%
  dplyr::group_split(split_group)
    
  # perform spam_grams
  spam_grams <- df_group %>%
    furrr::future_map(~LimpiaR::limpiar_spam_grams(
      .x, 
      message_clean, 
      n_gram = 6, 
      min_freq = 5
      ),
      .options = furrr::furrr_options(seed = T)
      )
  
  spam <- c(spam, spam_grams)
  
}

I did not find it necessary to only load small sections of the data in when removing stop words, this could be performed as normal, by splitting the full dataset according to available cores and using future_map to remove stop words in each section.

3 Data Processing

3.1 Data Cleaning

Data cleaning can be handled very close to standard:

For very large datasets, spam removal can be done according to the parallel processing section.
We only need to clean the data for pos tagging for now, consider removing hashtags and emojis but not punctuation or any formatting that might help the model tag words.
Unless you want it for partiuclar cleaning steps, stop word removal and any other text formatting you wish to perform can be done later, this is because we are splitting the the posts into sentences for the pos tagging anyway.

4 Parts of speech tagging

After briefly looking at some alternative models on huggingface, I decided to go with the tried and tested udpipe model available via udpipe or our own LimpiaR package. It is entirely up to you which package you choose to use but do note that some of the column names have been changed in our LimpiaR functions from the udpipe source.

Some helpful resources on parts-of-speech tagging

https://jpcapture.quarto.pub/dependency-parsing--phrase-mining-with-udpipe/ https://bnosac.github.io/udpipe/docs/doc7.html

I am going to do a quick run through using the LimpiaR package

First we perform pos tagging and dependency parsing the clean_text_pos column (text column retaining all punctuation, capital letters, etc.).

Important

Be sure to set dependency_parse = TRUE

library(LimpiaR)
model <- limpiar_pos_import_model(language = "english")

annotations <- limpiar_pos_annotate(data = collect(data),
                                    text_var = clean_text_pos,
                                    id_var = "universal_message_id",
                                    pos_model = model,
                                    in_parallel = TRUE,
                                    dependency_parse = TRUE) %>%
  to_arrow()

[1] 49067

This gives us:

annotations %>% colnames()

 [1] "doc_id"               "paragraph_id"         "sentence_id"         
 [4] "sentence"             "token_id"             "token"               
 [7] "lemma"                "pos_tag"              "xpos"                
[10] "feats"                "head_token_id"        "dependency_tag"      
[13] "universal_message_id"

annotations %>% 
  head(20) %>% 
  collect() %>%
  DT::datatable(options = list(scrollX = TRUE, pageLength = 3))

We then want to join the dataset with itself based on where token_id is equal to head_token_id. head_token_id indicates the id of the word in the sentence that each token relates to, by joining like this we can easily see how words relate to each other. While we can use arrow objects as an input to the merge function, it is a standard data.frame output, to keep the data size small, convert the output to an arrow object.

Note

Be sure to add “_parent”, or a similar identifier, to the column names associated with the parent tokens (columns sourced from the y part of the merge).

annotations_merge <- merge(
  x = annotations, 
  y = annotations, 
  by.x = c("universal_message_id", 
           "paragraph_id", 
           "sentence_id", 
           "head_token_id"),
  by.y = c("universal_message_id", 
           "paragraph_id", 
           "sentence_id", 
           "token_id"),
  all.x = TRUE, 
  all.y = FALSE,
  suffixes = c("", "_parent"), 
  sort = FALSE) %>%
  as_arrow_table()

That leaves us with:

annotations_merge %>% colnames()

 [1] "universal_message_id"  "paragraph_id"          "sentence_id"          
 [4] "head_token_id"         "doc_id"                "sentence"             
 [7] "token_id"              "token"                 "lemma"                
[10] "pos_tag"               "xpos"                  "feats"                
[13] "dependency_tag"        "doc_id_parent"         "sentence_parent"      
[16] "token_parent"          "lemma_parent"          "pos_tag_parent"       
[19] "xpos_parent"           "feats_parent"          "head_token_id_parent" 
[22] "dependency_tag_parent"

annotations_merge %>% 
  head(20) %>% 
  collect() %>%
  DT::datatable(options = list(scrollX = TRUE, pageLength = 3))

Having trouble merging?

The merging function is slow and sometimes merging the dataset with itself crashed my session. I think it’s preferable to merge the entire dataset as it gives you more flexibility later on if your analysis leads to particular parts of speech you want to investigate, however, if the merge is too computationally heavy or you’re stuck for time, you can filter to only include tokens where the relevant product is the nominal subject before merging, this is what we are going to do to create the standard viz anyway. The dataset loaded here is a Copilot dataset so:

annotations_filt <- annotations %>%
  filter(token == "copilot",
         dependency_tag == "nsubj")

annotations_merge <- merge(
  x = annotations_filt, 
  y = annotations, 
  by.x = c("universal_message_id", 
           "paragraph_id", 
           "sentence_id", 
           "head_token_id"),
  by.y = c("universal_message_id", 
           "paragraph_id", 
           "sentence_id", 
           "token_id"),
  all.x = TRUE, 
  all.y = FALSE,
  suffixes = c("", "_parent"), 
  sort = FALSE)

4.1 Pulling out Relevant Parts of Speech

For 758, we were looking at specific Microsoft products and so for each dataset (copilot, azure, bing…) we pulled out only the word pairings where the relevant product was the subject of the sentence. For example, we currently have a sample of a copilot dataset loaded in so we could pull out only the adjectives that act on the word copilot

annotations_merge %>%
  filter(tolower(token) == "copilot", # copilot is the token
         dependency_tag %in% "nsubj", # copilot is the nominal subject
         pos_tag_parent == "ADJ") %>% # parent token is adjective
  mutate(term = paste(lemma_parent, token, sep = " ")) %>%
  select(sentence, term) %>%
  collect() %>%
  DT::datatable(options = list(scrollX = TRUE, pageLength = 3))

or the verbs that act on the word copilot

annotations_merge %>%
  filter(tolower(token) == "copilot", # copilot is the token
         dependency_tag %in% "nsubj", # copilot is the nominal subject
         pos_tag_parent == "VERB") %>% # parent token is adjective
  mutate(term = paste(lemma_parent, token, sep = " ")) %>%
  select(sentence, term) %>%
  collect() %>%
  DT::datatable(options = list(scrollX = TRUE, pageLength = 3))

or the nouns (note how I am also filtering for proper nouns)

annotations_merge %>%
  filter(tolower(token) == "copilot", # copilot is the token
         dependency_tag %in% "nsubj", # copilot is the nominal subject
         pos_tag_parent %in% c("NOUN", "PROPN")) %>% # parent token is adjective
  mutate(term = paste(lemma_parent, token, sep = " ")) %>%
  select(sentence, term) %>%
  collect() %>%
  DT::datatable(options = list(scrollX = TRUE, pageLength = 3))

These are only a small section of the parts-of-speech tags available. A full list of tags and explanations can be found here and it could be worth exploring other relationships in future analyses.

annotations_merge %>% distinct(pos_tag) %>% collect()

# A tibble: 17 × 1
   pos_tag
   <chr>  
 1 PUNCT  
 2 INTJ   
 3 ADV    
 4 VERB   
 5 AUX    
 6 DET    
 7 PART   
 8 PRON   
 9 ADJ    
10 PROPN  
11 CCONJ  
12 SCONJ  
13 ADP    
14 NOUN   
15 SYM    
16 X      
17 NUM

4.2 Making the Viz

4.2.1 POS Top Terms

The first viz we made was a top terms for adjectives, verbs and nouns.

First we summarise the top terms, grouped by parent token and parent pos tag.

top_terms <- annotations_merge %>%
  filter(tolower(token) == "copilot",
         dependency_tag == "nsubj",
         pos_tag_parent %in% c("NOUN", "PROPN", "ADJ", "VERB")) %>%
  mutate(pos_tag_parent = case_when(
    pos_tag_parent == "PROPN" ~ "NOUN",
    TRUE ~ pos_tag_parent
  ),
  lemma_parent = tolower(lemma_parent)) %>%
  group_by(pos_tag_parent, lemma_parent) %>%
  summarise(word_freq = n())

top_terms %>%
  arrange(desc(word_freq)) %>%
  collect()

# A tibble: 125 × 3
# Groups:   pos_tag_parent [3]
   pos_tag_parent lemma_parent word_freq
   <chr>          <chr>            <int>
 1 VERB           do                  10
 2 VERB           have                 9
 3 VERB           help                 8
 4 VERB           be                   5
 5 NOUN           tool                 5
 6 ADJ            better               4
 7 VERB           get                  4
 8 VERB           generate             4
 9 VERB           give                 4
10 NOUN           ai                   4
# ℹ 115 more rows

Then we can make the chart using this

Tip

You can find the function brand_top_terms_nsub in the9.functions.R file in the 758 directory.

brand_top_terms_nsub(
  df = collect(top_terms), 
  top_n = 15, 
  colour = "blue")

4.3 Clustered top terms

It is difficult to appreciate the true picture of user perceptions by looking simply at top terms. Maybe the word “bad” is used far more frequently than “good”, but on aggregate, positively skewing words like “good”, “best”, “perfect” are used more frequently than negatively skewing words. This was the motivation behind attempting to cluster terms. While I do think this was a useful pursuit, I think there is more work and thought that could go into this particular step.

In 758 we only clustered adjectives that are related to the product / brand of interest, but there is no reason that this couldn’t be expanded on.

I looked at a few different embedding models for single word embeddings (Word2Vec, Wordnet grouping (didn’t actually get this working)…), but decided I was getting the best results with the BAAI/bge-large-en-v1.5 model and used the BertopicR pipeline to cluster the adjectives.

First embed and reduce

library(BertopicR)

adjs <- annotations_merge %>%
  filter(tolower(token) == "copilot",
         dependency_tag == "nsubj",
         pos_tag_parent== "ADJ") %>%
  pull(lemma_parent, as_vector = T) %>%
  udpipe::txt_freq() %>% 
  arrange(desc(freq)) 

embedder <- bt_make_embedder_st("BAAI/bge-large-en-v1.5")
embeddings <- bt_do_embedding(embedder,
                              adjs$key,
                              accelerator = "mps")

reducer <- bt_make_reducer_umap(metric = "cosine")
reduced_embeddings <- bt_do_reducing(reducer, embeddings)

UMAP(angular_rp_forest=True, low_memory=False, metric='cosine', min_dist=0.0, n_components=5, random_state=42, verbose=True)
Wed Jul 24 12:47:40 2024 Construct fuzzy simplicial set
Wed Jul 24 12:47:40 2024 Finding Nearest Neighbors
Wed Jul 24 12:47:41 2024 Finished Nearest Neighbor Search
Wed Jul 24 12:47:42 2024 Construct embedding
Wed Jul 24 12:47:43 2024 Finished embedding

I set a min_cluster_size based on number of embedded adjectives to attempt to keep some consistency in cluster number. This is not something that has to be done and could be left to your discretion. I also switched between cluster_selection_method being “leaf” or “eom” on a case-by-case basis.

min_cluster_size <- round(nrow(adjs)/15)

clusterer <- bt_make_clusterer_hdbscan(
  min_cluster_size = min_cluster_size,
  min_samples = 3L,
  cluster_selection_method = "eom")

clusters <- bt_do_clustering(clusterer, reduced_embeddings) 

table(clusters)

clusters
-1  0  1  2 
 4 20  6  3

Note that I am using a sample dataset here of 200 posts, we didn’t present, or even obtain, clusters this small in the project

You could at this point create a Bertopic model with the clusters and use the bt_representation_openai function (I would suggest using a custom prompt) to title your clusters. I did not find this particularly helpful.

It can be helpful to look at a umap of the clusters before deciding if you are happy with them.

reducer2d <- bt_make_reducer_umap(n_components = 2L)
reduced_embeddings2d <- bt_do_reducing(reducer2d, embeddings)

UMAP(low_memory=False, min_dist=0.0, random_state=42, verbose=True)
Wed Jul 24 12:47:43 2024 Construct fuzzy simplicial set
Wed Jul 24 12:47:43 2024 Finding Nearest Neighbors
Wed Jul 24 12:47:43 2024 Finished Nearest Neighbor Search
Wed Jul 24 12:47:43 2024 Construct embedding
Wed Jul 24 12:47:43 2024 Finished embedding

plot_df <- data.frame(
  key = adjs$key,
  cluster = as.factor(clusters),
  V1 = reduced_embeddings2d[,1],
  V2 = reduced_embeddings2d[,2]) 
  # filter(cluster != -1)

plot_df %>%
  plotly::plot_ly(x = ~V1, y = ~V2,
                  type = "scatter", mode = "markers",
                  color = ~cluster,
                  text = ~key,
                  hoverinfo = "text")

If we are happy with the clustsers, we can then visualise the top terms in each cluster.

Tip

You can find the function top_adj_grouped in the9.functions.R file in the 758 directory.

Note that the title argument is for you to enter your own cluster titles as you see fit.

top_adj_grouped_all(df = adjs,
                clusters = clusters, 
                title = as.character(rep(1:n_distinct(clusters))),
                top_n = 5, # we would realistiaclly show more for a delivery
                colour = "blue")

Maybe we should consider min_freq

I didn’t build a min_freq threshold into the clustering step or the top terms function, however we only present a subset of the clusters and, largely speaking, presented the clusters with the highest number of adjectives.

It is difficult to say whether or not it is appropriate to build a min_freq threshold into the clustering step as the purpose of that step is to capture words and semantics that might otherwise be missed due to low volume, but it is probably wise to have some sort of a minimum number of total words per cluster.

Anyway… something you could think about!

Note that I did not actually present each cluster, and chose the clusters to present based on the story we were telling and the number of adjectives in the cluster. The function top_adj_grouped in the 9.functions.R file allows you to manually select clusters to display. This is where this step needs some thought -

should we be presenting only a subset of the clusters?
Should we be trying to pull clusters with standard semantic meanings from each group?
Should we be reverting to measuring sentiment altogether (that said, only measuring sentiment as it appears in posts where the product of interest is the subject)?
Is there a better way of doing this?
It could be interesting to see if we could train a clustering model that would form 3 or 4 standard clusters so that comparison across groups could be standardised.

4.4 Term Sentiment Network

The final viz I made for each specific product was a term network grouped by sentiment. This is fairly straight forward and plucked straight from the ParseR package. We did toy with looking at just adjectives, or just nouns, or conversation as a whole, and settled on conversation as a whole but this is something that could be played around with provided it is consistent across products.

Since we are going to make a term viz, it might be beneficial to use lemmatised words, be aware that models do not always lemmatise correctly and you should keep an eye out for anything that looks like it’s not right.

As you can see, the product “Bing” is often lemmatised to “Be” and this is a correction we should hard code if we are using lemmatised words.

annotations_merge %>%
  filter(tolower(token) == "bing") %>%
  group_by(lemma) %>% 
  summarise(n = n()) %>%
  collect()

# A tibble: 3 × 2
  lemma     n
  <chr> <int>
1 be       12
2 Bing      4
3 bing      1

Conveniently, our pos tagging has already lemmatised our posts. We can rejoin those lemmatised words into their original post.

annotations_lemma <- annotations_merge %>% 
  group_by(universal_message_id, sentence_id) %>%
  mutate(sentence_id = as.numeric(sentence_id),
         token_id = as.numeric(token_id)) %>%
  arrange(universal_message_id, sentence_id, token_id) %>% 
  collect() %>%
  summarise(message_lemma = paste(lemma, collapse = " ")) %>%
  ungroup() %>%
  as_arrow_table()

In order to group the data by sentiment, we need to join the tagged data back to the original data.

sentiment_data <- data %>%
  select(universal_message_id, sentiment)

data_joined <- annotations_merge %>%
  filter(tolower(token) == "copilot",
         dependency_tag == "nsubj") %>% # filter to where Copilot is the subject
  select(universal_message_id, sentence) %>%
  distinct(universal_message_id, sentence) %>% # only one row per sentence
  left_join(sentiment_data, by = "universal_message_id")

One thing to note is that we haven’t yet removed stop words from the data and this is something we will need to do before creating a sentiment term network.

Note

#Data Size Depending on the size of the dataset, this might be somewhere you could use furrr to remove stop words in parallel. See the parallel processing section for an example of how to do this.

data_joined <- data_joined %>%
  collect() %>%
  mutate(sentence_clean = tm::removeWords(sentence, tm::stopwords("en"))) %>%
  as_arrow_table()

When making the table there are some style options we can make, you can see the ParseR documentation for more detail.

selected_terms <- c("NEGATIVE", "POSITIVE", "NEUTRAL",
                    "search", "windows", "msft", "microsoft",
                    "google", "github", "chatgpt", "copilot",
                    "gpt")

sentiment_colours <- c("NEGATIVE" = "#8b0000",
                       "NEUTRAL" = "grey45",
                       "POSITIVE" = "#008b00")

set.seed(12)
data_joined %>% 
    collect() %>%
    ParseR::viz_group_terms_network(group_var = sentiment,
                                    text_var = sentence_clean,
                                    n_terms = 20,
                                    text_size = 4,
                                    with_ties = TRUE,
                                    group_colour_map = sentiment_colours,
                                    terms_colour = "grey50",
                                    selected_terms = selected_terms,
                                    selected_terms_colour =  "black")

Note

I did look at creating this based on correlation but found that is was bringing less frequent, and sometimes spammy or irrelevant, terms to the forefront. For example there is an excel scooter which was obviously brought in within the Excel query and was not removed during my cleaning steps and so, while it was mentioned in only a small number of posts, there were various scooter related terms coming out in the correlation chart.

That said, I would be open to having another look at this or other statistical measures we could put to a term network.