Core Language Study Methodologies
Inspired by 758 Microsoft Perceptions Landscape
1 Introduction
1.1 Background
Microsoft initially approached us looking to investigate user perceptions for the Microsoft brand and its core product suite. What we initially thought would be a straight forward parts-of-speech analysis, comprising of adjective, verb and noun top terms, bigrams and term context networks turned into a deeper analysis of language.
The resulting methodology developed for this project has formed what we now hope to offer in future as a “Core Language Study”. You can look at the delivered deck for more context.
1.2 Objectives
This document will aim to walk-through the developed methodologies and highlight the catches I came across in the process. I will also try to point out areas that I think we could dig deeper into in future projects.
2 Dealing with Large Datasets
Before moving onto the methods, I want to spend a minute talking about how to efficiently process large scale data, without this, the project would have been impossible. Not taking small samples is important here as later we are going to filter to only posts that use the subject of interest (product / brand …) as the nominal subject of the sentence (more on this later), and this can reduce the data we are working with quite significantly.
It may not be the case for all datasets, but the volume of data collected for Microsoft and its core products was in the hundreds of thousands - not exactly what my laptop wants to deal with on a Monday morning. At first glance it may seem that everything is working ok when you load the data into R but as you start to create mutations of the original dataframes and perform functions like spam_grams on the data, things quickly begin to go sour.
Here are my top tips for dealing with this.
2.1 Use Arrow and Duckdb
The R Arrow library allows us to load data into R stored in columnar format and pass that data to and from duckdb tables. There are a subset of dplyr functions compatible with arrow data and a larger subset compatible with duckdb tables. This allows us do a lot of the work we need to do without loading all of our data into RAM.
Arrow stores in-memory data in columnar format, which is optimised for both storage size and computation speed. Essentially, it reduces the burden on RAM and accelerates computations.
Duckdb is a database management system designed for efficient querying. The R Arrow package provides a convenient function to_duckdb() which allows us to create a virtual table in DuckDB, a format that is compatible with a wider range of dplyr functions.
This is a good resource for understanding how arrow and duckdb work together to process large data.
Unfortunately, we can’t read in xlsx using arrow and assuming that is the original format of our data, we would first have to read in the xlsx files using readxl and then save as either a parquet or csv file, both which can be read in using arrow. This isn’t ideal but, unless the dataset is in the millions, simply reading in data and resaving it in a different format, while time consuming, likely won’t send your hard drive into shutdown. If you want to be on the safe side and avoid loading that amount of data into RAM, I would suggest reading the xlsx files, and resaving them in an appropriate format, in small chunks
A better solution would be for the analyst exporting the data to export it as a csv - if you read this in time, you should ask them to.
If the data is in csv format and all saved in the same folder, we can read it all in with arrow using:
library(arrow)
file <- "data/csv_file"
data <- open_dataset(file, format = "csv", newlines_in_values = T)
data %>% class()[1] "FileSystemDataset" "Dataset" "ArrowObject"
[4] "R6"
We can use many dplyr functions as normal:
data %>% nrow()[1] 200
data %>% colnames() [1] "universal_message_id" "created_time"
[3] "sender_screen_name" "title"
[5] "message" "message_type"
[7] "sentiment" "social_network"
[9] "permalink" "language"
[11] "sender_user_id" "sender_followers_count"
[13] "sender_listed_name" "retweets"
[15] "favorites" "conversation_id"
[17] "parent_universal_message_id" "message_clean"
[19] ".document%%num_cuts" "words_count"
If we want to collect a table output from our computation, we will need to call collect()
data %>% slice_sample(n = 3) %>% collect()# A tibble: 3 × 20
universal_message_id created_time sender_screen_name title message
<chr> <dttm> <chr> <chr> <chr>
1 TWITTER_7_17931018032168… 2024-05-22 02:09:18 casamassacre NA "@drvo…
2 WEB_98_45c9f164bbbcc8fcc… 2023-11-12 19:28:00 Edsontje Code… "Code …
3 TWITTER_7_17204477552465… 2023-11-03 14:28:03 ntropy_dev NA "This …
# ℹ 15 more variables: message_type <chr>, sentiment <chr>,
# social_network <chr>, permalink <chr>, language <chr>,
# sender_user_id <chr>, sender_followers_count <int>,
# sender_listed_name <chr>, retweets <int>, favorites <int>,
# conversation_id <chr>, parent_universal_message_id <chr>,
# message_clean <chr>, `.document%%num_cuts` <int>, words_count <int>
A full list of compatible dplyr functions can be found here.
If you want to use even more dplyr functions that aren’t available in that list, including some stringr functions, you can use the arrow function to_duckdb() to create a DuckDB table.
set.seed(12)
data %>%
to_duckdb() %>%
filter(str_detect(message, "copilot")) %>%
slice_sample(n = 5) %>%
pull(message_clean)[1] "the decision to adopt github copilot was actually spearheaded by our ceo who challenged every department to explore how ai could enhance our workflows as an existing github customer the integration and onboarding process was seamless githubs sales team proactively reached out to us offering dedicated training sessions to get our engineers up to speed on github copilots capabilities this lowfriction approach was key we were able to trial the tool with a small group of developers before committing to a full rollout what we found was that github copilot excels in certain areas of our development process for greenfield projects or ideation around new features github copilot is invaluable in quickly prototyping code and providing helpful boilerplate its ability to draft coherent documentation has also been a revelation taking a tedious task off our engineers plates and when exploring how to implement something in a new language or framework github copilots contextual code suggestions save us time versus scouring stack overflow and scouring the internet of course github copilot also has its limitations for deeply complex existing codebases or intricate debugging tasks we find its less reliable the tool cant fully replace the nuanced judgment and problemsolving skills of our seasoned developers you really need to review the output from github copilot in this case actually be extra cautious and skeptical of it and then make necessary adjustments and corrections having said that as a forcemultiplier on mundane coding activities it has definitely unlocked new levels of productivity and focus for our team these kinds of productivity gains however are difficult to quantify so it is a subjective judgment and i would look at it as the new possibilities and capabilities gained that were previously not available to the team generative ai has come a long way and there is a lot of transformative potential of these technologies when applied to software development our engineers have enthusiastically embraced it and im excited to see how we can continue leveraging github copilot to work smarter in summary the decision to adopt github copilot was a nobrainer with our existing github relationship the low barrier to entry and the clear productivity gains it was a natural fit for our organization"
[2] "the vision is that copilot becomes the go to ai for every windows user because you only have to hit one button to open it up"
[3] "this is whats difficult for me to gauge parts of ai for sure have value github copilot i think does anecdotally ive heard that google and meta i think do but were waiting on the killer app nobody knows what it is yet my concern is not the long term usefulness of the"
[4] "you dogpiled on robtop for using ai for what its actually meant to copilots not even that big of a thing and now youre telling this girl what she should feel and what she shouldnt feel bro what"
[5] "when neuralink glitches the ai copilot takes control welcome to the future of autopilot"
If saving the duckdb output as a variable in memory, it is good practice to use to_arrow() to convert the duckdb output back to an arrow object after the computation has been completed.
While still orders of magnitude smaller than loading the data in as a csv, storing the data as a duckdb object does occupy more memory than an arrow object.
object.size(data)504 bytes
data_duckdb <- data %>% to_duckdb()
object.size(data_duckdb)40824 bytes
2.2 Use Parallel Processing when Appropriate
The main area I used parallel process here (where the option was not already built in as a function argument) was in spam and stop word removal. Stop word removal is not dependent on the contents of any other post, so this is straightforward, our current spam removal function, limpiar_spam_grams, is based on word-gram frequency across posts and so we need to be careful with how we change our parameters based on how we split our data across cores.
Removing spam is a particularly slow process, and I found the fastest way to do it was to only load small sections of my data into memory at a time and process that small section in parallel before loading in and processing the next section. I suspect this is dependent on data size and is not necessary for smaller datasets.
First plan the parallel sessions
used_cores <- as.numeric(future::availableCores() - 1)
future::plan(future::multisession, workers = used_cores) Set the number of groups you want to pull from the data and add a rowid column to the data so we can track this. I am numbering the data according to created time so that any spam posts that might be released for a similar event is loaded in together.
n_retrievals <- 4 # number of groups to pull from database
split_length <- ceiling((data %>% nrow())/n_retrievals)
data <- data %>%
arrange(created_time) %>%
to_duckdb() %>%
mutate(rowid = row_number()) %>%
to_arrow()We can now loop through the data, performing spam grams on smaller sections of the dataset.
spam_grams <- list()
spam <- list()
for (i in 1:n_retrievals){
print(i) # track progress
# load in section of data
df_section <- data %>%
filter(rowid > (i-1)*split_length + 1, rowid <= i*split_length)
# split loaded data according to planned cores
df_group <- df_section %>%
collect() %>%
mutate(split_group = rep(1:used_cores,
each = ceiling(nrow(.)/used_cores),
length.out = nrow(.))) %>%
dplyr::group_split(split_group)
# perform spam_grams
spam_grams <- df_group %>%
furrr::future_map(~LimpiaR::limpiar_spam_grams(
.x,
message_clean,
n_gram = 6,
min_freq = 5
),
.options = furrr::furrr_options(seed = T)
)
spam <- c(spam, spam_grams)
}I did not find it necessary to only load small sections of the data in when removing stop words, this could be performed as normal, by splitting the full dataset according to available cores and using future_map to remove stop words in each section.
3 Data Processing
3.1 Data Cleaning
Data cleaning can be handled very close to standard:
For very large datasets, spam removal can be done according to the parallel processing section.
We only need to clean the data for pos tagging for now, consider removing hashtags and emojis but not punctuation or any formatting that might help the model tag words.
Unless you want it for partiuclar cleaning steps, stop word removal and any other text formatting you wish to perform can be done later, this is because we are splitting the the posts into sentences for the pos tagging anyway.
4 Parts of speech tagging
After briefly looking at some alternative models on huggingface, I decided to go with the tried and tested udpipe model available via udpipe or our own LimpiaR package. It is entirely up to you which package you choose to use but do note that some of the column names have been changed in our LimpiaR functions from the udpipe source.
I am going to do a quick run through using the LimpiaR package
First we perform pos tagging and dependency parsing the clean_text_pos column (text column retaining all punctuation, capital letters, etc.).
Be sure to set dependency_parse = TRUE
library(LimpiaR)
model <- limpiar_pos_import_model(language = "english")
annotations <- limpiar_pos_annotate(data = collect(data),
text_var = clean_text_pos,
id_var = "universal_message_id",
pos_model = model,
in_parallel = TRUE,
dependency_parse = TRUE) %>%
to_arrow()[1] 49067
This gives us:
annotations %>% colnames() [1] "doc_id" "paragraph_id" "sentence_id"
[4] "sentence" "token_id" "token"
[7] "lemma" "pos_tag" "xpos"
[10] "feats" "head_token_id" "dependency_tag"
[13] "universal_message_id"
annotations %>%
head(20) %>%
collect() %>%
DT::datatable(options = list(scrollX = TRUE, pageLength = 3))We then want to join the dataset with itself based on where token_id is equal to head_token_id. head_token_id indicates the id of the word in the sentence that each token relates to, by joining like this we can easily see how words relate to each other. While we can use arrow objects as an input to the merge function, it is a standard data.frame output, to keep the data size small, convert the output to an arrow object.
Be sure to add “_parent”, or a similar identifier, to the column names associated with the parent tokens (columns sourced from the y part of the merge).
annotations_merge <- merge(
x = annotations,
y = annotations,
by.x = c("universal_message_id",
"paragraph_id",
"sentence_id",
"head_token_id"),
by.y = c("universal_message_id",
"paragraph_id",
"sentence_id",
"token_id"),
all.x = TRUE,
all.y = FALSE,
suffixes = c("", "_parent"),
sort = FALSE) %>%
as_arrow_table()That leaves us with:
annotations_merge %>% colnames() [1] "universal_message_id" "paragraph_id" "sentence_id"
[4] "head_token_id" "doc_id" "sentence"
[7] "token_id" "token" "lemma"
[10] "pos_tag" "xpos" "feats"
[13] "dependency_tag" "doc_id_parent" "sentence_parent"
[16] "token_parent" "lemma_parent" "pos_tag_parent"
[19] "xpos_parent" "feats_parent" "head_token_id_parent"
[22] "dependency_tag_parent"
annotations_merge %>%
head(20) %>%
collect() %>%
DT::datatable(options = list(scrollX = TRUE, pageLength = 3))The merging function is slow and sometimes merging the dataset with itself crashed my session. I think it’s preferable to merge the entire dataset as it gives you more flexibility later on if your analysis leads to particular parts of speech you want to investigate, however, if the merge is too computationally heavy or you’re stuck for time, you can filter to only include tokens where the relevant product is the nominal subject before merging, this is what we are going to do to create the standard viz anyway. The dataset loaded here is a Copilot dataset so:
annotations_filt <- annotations %>%
filter(token == "copilot",
dependency_tag == "nsubj")
annotations_merge <- merge(
x = annotations_filt,
y = annotations,
by.x = c("universal_message_id",
"paragraph_id",
"sentence_id",
"head_token_id"),
by.y = c("universal_message_id",
"paragraph_id",
"sentence_id",
"token_id"),
all.x = TRUE,
all.y = FALSE,
suffixes = c("", "_parent"),
sort = FALSE) 4.1 Pulling out Relevant Parts of Speech
For 758, we were looking at specific Microsoft products and so for each dataset (copilot, azure, bing…) we pulled out only the word pairings where the relevant product was the subject of the sentence. For example, we currently have a sample of a copilot dataset loaded in so we could pull out only the adjectives that act on the word copilot
annotations_merge %>%
filter(tolower(token) == "copilot", # copilot is the token
dependency_tag %in% "nsubj", # copilot is the nominal subject
pos_tag_parent == "ADJ") %>% # parent token is adjective
mutate(term = paste(lemma_parent, token, sep = " ")) %>%
select(sentence, term) %>%
collect() %>%
DT::datatable(options = list(scrollX = TRUE, pageLength = 3))or the verbs that act on the word copilot
annotations_merge %>%
filter(tolower(token) == "copilot", # copilot is the token
dependency_tag %in% "nsubj", # copilot is the nominal subject
pos_tag_parent == "VERB") %>% # parent token is adjective
mutate(term = paste(lemma_parent, token, sep = " ")) %>%
select(sentence, term) %>%
collect() %>%
DT::datatable(options = list(scrollX = TRUE, pageLength = 3))or the nouns (note how I am also filtering for proper nouns)
annotations_merge %>%
filter(tolower(token) == "copilot", # copilot is the token
dependency_tag %in% "nsubj", # copilot is the nominal subject
pos_tag_parent %in% c("NOUN", "PROPN")) %>% # parent token is adjective
mutate(term = paste(lemma_parent, token, sep = " ")) %>%
select(sentence, term) %>%
collect() %>%
DT::datatable(options = list(scrollX = TRUE, pageLength = 3))These are only a small section of the parts-of-speech tags available. A full list of tags and explanations can be found here and it could be worth exploring other relationships in future analyses.
annotations_merge %>% distinct(pos_tag) %>% collect()# A tibble: 17 × 1
pos_tag
<chr>
1 PUNCT
2 INTJ
3 ADV
4 VERB
5 AUX
6 DET
7 PART
8 PRON
9 ADJ
10 PROPN
11 CCONJ
12 SCONJ
13 ADP
14 NOUN
15 SYM
16 X
17 NUM
4.2 Making the Viz
4.2.1 POS Top Terms
The first viz we made was a top terms for adjectives, verbs and nouns.
First we summarise the top terms, grouped by parent token and parent pos tag.
top_terms <- annotations_merge %>%
filter(tolower(token) == "copilot",
dependency_tag == "nsubj",
pos_tag_parent %in% c("NOUN", "PROPN", "ADJ", "VERB")) %>%
mutate(pos_tag_parent = case_when(
pos_tag_parent == "PROPN" ~ "NOUN",
TRUE ~ pos_tag_parent
),
lemma_parent = tolower(lemma_parent)) %>%
group_by(pos_tag_parent, lemma_parent) %>%
summarise(word_freq = n())
top_terms %>%
arrange(desc(word_freq)) %>%
collect()# A tibble: 125 × 3
# Groups: pos_tag_parent [3]
pos_tag_parent lemma_parent word_freq
<chr> <chr> <int>
1 VERB do 10
2 VERB have 9
3 VERB help 8
4 VERB be 5
5 NOUN tool 5
6 ADJ better 4
7 VERB get 4
8 VERB generate 4
9 VERB give 4
10 NOUN ai 4
# ℹ 115 more rows
Then we can make the chart using this
You can find the function brand_top_terms_nsub in the9.functions.R file in the 758 directory.
brand_top_terms_nsub(
df = collect(top_terms),
top_n = 15,
colour = "blue")4.3 Clustered top terms
It is difficult to appreciate the true picture of user perceptions by looking simply at top terms. Maybe the word “bad” is used far more frequently than “good”, but on aggregate, positively skewing words like “good”, “best”, “perfect” are used more frequently than negatively skewing words. This was the motivation behind attempting to cluster terms. While I do think this was a useful pursuit, I think there is more work and thought that could go into this particular step.
In 758 we only clustered adjectives that are related to the product / brand of interest, but there is no reason that this couldn’t be expanded on.
I looked at a few different embedding models for single word embeddings (Word2Vec, Wordnet grouping (didn’t actually get this working)…), but decided I was getting the best results with the BAAI/bge-large-en-v1.5 model and used the BertopicR pipeline to cluster the adjectives.
First embed and reduce
library(BertopicR)
adjs <- annotations_merge %>%
filter(tolower(token) == "copilot",
dependency_tag == "nsubj",
pos_tag_parent== "ADJ") %>%
pull(lemma_parent, as_vector = T) %>%
udpipe::txt_freq() %>%
arrange(desc(freq))
embedder <- bt_make_embedder_st("BAAI/bge-large-en-v1.5")
embeddings <- bt_do_embedding(embedder,
adjs$key,
accelerator = "mps")
reducer <- bt_make_reducer_umap(metric = "cosine")
reduced_embeddings <- bt_do_reducing(reducer, embeddings)UMAP(angular_rp_forest=True, low_memory=False, metric='cosine', min_dist=0.0, n_components=5, random_state=42, verbose=True)
Wed Jul 24 12:47:40 2024 Construct fuzzy simplicial set
Wed Jul 24 12:47:40 2024 Finding Nearest Neighbors
Wed Jul 24 12:47:41 2024 Finished Nearest Neighbor Search
Wed Jul 24 12:47:42 2024 Construct embedding
Wed Jul 24 12:47:43 2024 Finished embedding
I set a min_cluster_size based on number of embedded adjectives to attempt to keep some consistency in cluster number. This is not something that has to be done and could be left to your discretion. I also switched between cluster_selection_method being “leaf” or “eom” on a case-by-case basis.
min_cluster_size <- round(nrow(adjs)/15)
clusterer <- bt_make_clusterer_hdbscan(
min_cluster_size = min_cluster_size,
min_samples = 3L,
cluster_selection_method = "eom")
clusters <- bt_do_clustering(clusterer, reduced_embeddings)
table(clusters)clusters
-1 0 1 2
4 20 6 3
Note that I am using a sample dataset here of 200 posts, we didn’t present, or even obtain, clusters this small in the project
You could at this point create a Bertopic model with the clusters and use the bt_representation_openai function (I would suggest using a custom prompt) to title your clusters. I did not find this particularly helpful.
It can be helpful to look at a umap of the clusters before deciding if you are happy with them.
reducer2d <- bt_make_reducer_umap(n_components = 2L)
reduced_embeddings2d <- bt_do_reducing(reducer2d, embeddings)UMAP(low_memory=False, min_dist=0.0, random_state=42, verbose=True)
Wed Jul 24 12:47:43 2024 Construct fuzzy simplicial set
Wed Jul 24 12:47:43 2024 Finding Nearest Neighbors
Wed Jul 24 12:47:43 2024 Finished Nearest Neighbor Search
Wed Jul 24 12:47:43 2024 Construct embedding
Wed Jul 24 12:47:43 2024 Finished embedding
plot_df <- data.frame(
key = adjs$key,
cluster = as.factor(clusters),
V1 = reduced_embeddings2d[,1],
V2 = reduced_embeddings2d[,2])
# filter(cluster != -1)
plot_df %>%
plotly::plot_ly(x = ~V1, y = ~V2,
type = "scatter", mode = "markers",
color = ~cluster,
text = ~key,
hoverinfo = "text") If we are happy with the clustsers, we can then visualise the top terms in each cluster.
You can find the function top_adj_grouped in the9.functions.R file in the 758 directory.
Note that the title argument is for you to enter your own cluster titles as you see fit.
top_adj_grouped_all(df = adjs,
clusters = clusters,
title = as.character(rep(1:n_distinct(clusters))),
top_n = 5, # we would realistiaclly show more for a delivery
colour = "blue")min_freq
I didn’t build a min_freq threshold into the clustering step or the top terms function, however we only present a subset of the clusters and, largely speaking, presented the clusters with the highest number of adjectives.
It is difficult to say whether or not it is appropriate to build a min_freq threshold into the clustering step as the purpose of that step is to capture words and semantics that might otherwise be missed due to low volume, but it is probably wise to have some sort of a minimum number of total words per cluster.
Anyway… something you could think about!
Note that I did not actually present each cluster, and chose the clusters to present based on the story we were telling and the number of adjectives in the cluster. The function top_adj_grouped in the 9.functions.R file allows you to manually select clusters to display. This is where this step needs some thought -
- should we be presenting only a subset of the clusters?
- Should we be trying to pull clusters with standard semantic meanings from each group?
- Should we be reverting to measuring sentiment altogether (that said, only measuring sentiment as it appears in posts where the product of interest is the subject)?
- Is there a better way of doing this?
- It could be interesting to see if we could train a clustering model that would form 3 or 4 standard clusters so that comparison across groups could be standardised.
4.4 Term Sentiment Network
The final viz I made for each specific product was a term network grouped by sentiment. This is fairly straight forward and plucked straight from the ParseR package. We did toy with looking at just adjectives, or just nouns, or conversation as a whole, and settled on conversation as a whole but this is something that could be played around with provided it is consistent across products.
Since we are going to make a term viz, it might be beneficial to use lemmatised words, be aware that models do not always lemmatise correctly and you should keep an eye out for anything that looks like it’s not right.
As you can see, the product “Bing” is often lemmatised to “Be” and this is a correction we should hard code if we are using lemmatised words.
annotations_merge %>%
filter(tolower(token) == "bing") %>%
group_by(lemma) %>%
summarise(n = n()) %>%
collect()# A tibble: 3 × 2
lemma n
<chr> <int>
1 be 12
2 Bing 4
3 bing 1
Conveniently, our pos tagging has already lemmatised our posts. We can rejoin those lemmatised words into their original post.
annotations_lemma <- annotations_merge %>%
group_by(universal_message_id, sentence_id) %>%
mutate(sentence_id = as.numeric(sentence_id),
token_id = as.numeric(token_id)) %>%
arrange(universal_message_id, sentence_id, token_id) %>%
collect() %>%
summarise(message_lemma = paste(lemma, collapse = " ")) %>%
ungroup() %>%
as_arrow_table()In order to group the data by sentiment, we need to join the tagged data back to the original data.
sentiment_data <- data %>%
select(universal_message_id, sentiment)
data_joined <- annotations_merge %>%
filter(tolower(token) == "copilot",
dependency_tag == "nsubj") %>% # filter to where Copilot is the subject
select(universal_message_id, sentence) %>%
distinct(universal_message_id, sentence) %>% # only one row per sentence
left_join(sentiment_data, by = "universal_message_id")One thing to note is that we haven’t yet removed stop words from the data and this is something we will need to do before creating a sentiment term network.
#Data Size Depending on the size of the dataset, this might be somewhere you could use furrr to remove stop words in parallel. See the parallel processing section for an example of how to do this.
data_joined <- data_joined %>%
collect() %>%
mutate(sentence_clean = tm::removeWords(sentence, tm::stopwords("en"))) %>%
as_arrow_table()When making the table there are some style options we can make, you can see the ParseR documentation for more detail.
selected_terms <- c("NEGATIVE", "POSITIVE", "NEUTRAL",
"search", "windows", "msft", "microsoft",
"google", "github", "chatgpt", "copilot",
"gpt")
sentiment_colours <- c("NEGATIVE" = "#8b0000",
"NEUTRAL" = "grey45",
"POSITIVE" = "#008b00")
set.seed(12)
data_joined %>%
collect() %>%
ParseR::viz_group_terms_network(group_var = sentiment,
text_var = sentence_clean,
n_terms = 20,
text_size = 4,
with_ties = TRUE,
group_colour_map = sentiment_colours,
terms_colour = "grey50",
selected_terms = selected_terms,
selected_terms_colour = "black")I did look at creating this based on correlation but found that is was bringing less frequent, and sometimes spammy or irrelevant, terms to the forefront. For example there is an excel scooter which was obviously brought in within the Excel query and was not removed during my cleaning steps and so, while it was mentioned in only a small number of posts, there were various scooter related terms coming out in the correlation chart.
That said, I would be open to having another look at this or other statistical measures we could put to a term network.