Code
library(dplyr)
library(BertopicR)
library(ggplot2)Warning: package 'ggplot2' was built under R version 4.3.1
Code
library(stringr)
library(SegmentR)
set.seed(12)Something we have long suspected as a data science department is that post length could play a role in topic assignment during topic modelling. We frequently perform topic modelling with posts that range from short tweets to long reddit and forum posts. Do longer posts tend towards a particular topic? Do shorter posts? It makes sense that any post, but particularly longer posts, could touch on multiple topics and so it begs the question, are posts (particularly longer posts) being accurately summarised? Should we be tokenising posts into sentences / paragraphs / particular character counts so that all posts are of similar length?
This is a rather rambling list of things we are unsure about and will aim to answer, or at least interrogate, in this document.
We’ll first perform topic modelling using our current pipeline and how see how post length varies from topic to topic and then we will look at tokenising posts and how that changes the results.
There are two principle packages we use to perform topic modelling, SegmentR and BertopicR. We will perform topic modelling using both tools and look at how posts of varying length are distributed across topics in each.
Note: Unless otherwise stated, all embedding (all-mpnet-base-v2) and dimension reduction (umap) was performed on colab using the V100 gpu. This was done in the interest of computation time.
library(dplyr)
library(BertopicR)
library(ggplot2)Warning: package 'ggplot2' was built under R version 4.3.1
library(stringr)
library(SegmentR)
set.seed(12)First let’s load in the raw data and look at the variation of text length.
# load the data
data <- readr::read_csv("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/My Drive/data_science_project_work/microsoft/project_work/688_ai_landscape_q1_fy24/data/clean_data/688_sampled_data_250k.csv") %>%
mutate(text_bert = text) %>%
ParseR::clean_text(text_var = text_bert,
tolower = FALSE,
hashtags = FALSE,
mentions = TRUE,
punctuation = TRUE,
emojis = FALSE,
digits = TRUE) %>%
mutate(char_length = stringr::str_length(text_bert)) data %>%
ggplot(aes(y = char_length)) +
geom_boxplot() +
labs(x = "Character Length") +
theme_minimal() There are almost no posts with over 2000 characters and after looking att them, they are all spam. Let’s remove these and load in our precalculated embeddings (which exclude these posts).
data <- data %>%
filter(char_length <= 2000)
reduced_embeddings <- readr::read_rds("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/.shortcut-targets-by-id/0BwEyzS8OvJgreXdPNGZKV2tyRjg/Share_Clients/data_science_project_work/Topic Modelling for posts of variable length/data/embeddings/full_post_data_reduced_embeddings_V100gpu.rds")Now let’s look at how we would cluster
# make the clusterer
clusterer <- bt_make_clusterer_hdbscan(min_cluster_size = 1000L, min_samples = 5L, cluster_selection_method = "leaf")
# how many clusters would this make?
clusters <- bt_do_clustering(clusterer, reduced_embeddings)
table(clusters)clusters
-1 0 1 2 3 4 5 6 7 8 9
139174 7795 5683 4872 4806 4498 4227 4122 3579 3388 2915
10 11 12 13 14 15 16 17 18 19 20
2905 2745 2734 2611 2587 2568 2420 2268 2246 2217 2214
21 22 23 24 25 26 27 28 29 30 31
2095 1990 1972 1946 1927 1895 1817 1817 1682 1659 1584
32 33 34 35 36 37 38 39 40 41 42
1574 1483 1477 1334 1241 1230 1223 1223 1093 1065 1059
43 44
1036 1014
45 topics looks like a some-what manageable number, let’s go with that. Let’s fit the model.
model <- bt_compile_model(embedding_model = bt_empty_embedder(), # skip embedding step
reduction_model = bt_empty_reducer(), # skip reducing
clustering_model = clusterer) # perform clustering again
bt_fit_model(model, data$text_bert, reduced_embeddings)model$get_topic_info() %>% select(-Representative_Docs) %>%
DT::datatable(options = list(pageLength = 5))data <- data %>%
mutate(bert_topic = model$get_document_info(data$text_bert)$Topic) But how do posts with different lengths distribute across the topics? Let’s first look at the distribution of post length per topic.
data %>%
mutate(bert_topic = as.factor(bert_topic)) %>%
ggplot(aes(x = bert_topic, y = char_length)) +
geom_boxplot() +
labs(x = "Topic", y = "Character Length") +
theme_minimal() It looks like some topics contain posts with quite a low post length, eg. topic 8, 10 and 35, while some topics contain posts with a generally higher post length.
Let’s try to put some numbers on this.
central_tendency_bert <- data %>%
group_by(bert_topic) %>%
summarise(mean_length = mean(char_length),
median_length = median(char_length),
variance = var(char_length),
std = sd(char_length))
central_tendency_bert %>%
tidyr::gather(measure, result, -bert_topic) %>%
ggplot(aes(x = result)) +
geom_histogram() +
facet_wrap(~ measure, scales = "free") +
geom_histogram(data = subset(central_tendency_bert %>%
tidyr::gather(measure, result, -bert_topic), measure == "variance"),
aes(x = result), binwidth = 1000) +
labs(title = "Topics chosen using BertopicR")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
While the bulk of the data has a mean post length around 200 and a median between 100 and 200, there does seem to be a high range of variance in the data and there are definitely some topics that look like they have an outlier means and medians. There also seems to be one topic with a 0 variance, this wouldn’t make any sense, and that’s because it’s not 0 at all! The binwidth is automatically selected by the geom_histogram function and the range of the data simply means that it has a very large binwidth that encompasses 0.
central_tendency_bert %>%
arrange(variance) %>%
head(5)# A tibble: 5 × 5
bert_topic mean_length median_length variance std
<dbl> <dbl> <dbl> <dbl> <dbl>
1 8 77.2 68 1484. 38.5
2 3 84.4 67 3958. 62.9
3 35 72.4 52.5 5150. 71.8
4 10 103. 78 7848. 88.6
5 9 139. 106 12414. 111.
The actual variance is 1484 and refers to topic 8. Maybe we should look at topic 8 a bit closer.
data %>%
filter(bert_topic == 8) %>%
select(text_bert) %>% sample_n(10) %>%
DT::datatable(options = list(pageLength = 5))All looks ok, the topic seems to be about AI, maybe AI as referred to in a slightly negative or fearful way? It is clear that all of these posts are short and generally seem to be replies to something else. I wonder is it grouping based on mentions?
data %>%
filter(bert_topic == 8, str_detect(text_bert, "^@")) %>% nrow()[1] 3331
data %>% filter(bert_topic == 8) %>% nrow()[1] 3388
98% of these posts begin with a mention! If we removed mentions from the data before embedding, would this topic disappear? This is something I might look at more later.
But back to the stats… We’ll perform an anova across topics to see if there is a statistical difference between groups.
aov_df <- data %>%
mutate(bert_topic = as.factor(bert_topic))
summary(aov(char_length ~ bert_topic, data = aov_df)) Df Sum Sq Mean Sq F value Pr(>F)
bert_topic 45 4.954e+08 11009032 354.3 <2e-16 ***
Residuals 248964 7.736e+09 31072
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Definitely significant…
If we ignore the outlier category?
aov_df <- data %>%
filter(bert_topic != -1) %>%
mutate(bert_topic = as.factor(bert_topic))
summary(aov(char_length ~ bert_topic, data = aov_df, var.equal = FALSE))Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
extra argument 'var.equal' will be disregarded
Df Sum Sq Mean Sq F value Pr(>F)
bert_topic 44 4.634e+08 10532595 295.5 <2e-16 ***
Residuals 109791 3.914e+09 35646
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Still significant, the outlier topic isn’t causing this.
What does this mean? It is possible that messages are being pulled from from forums where there are long threads (of long messages) all replying to and discussing the same thing and thus belong in the same topic, and that shorter posts are being grouped together because they’re a bit more ambiguous and make a topic of “AI mentions we can’t classify” or something like that. It is also possible that the length of posts is affecting their embeddings and thus the clusters they are assigned to. Let’s see does LDA give similar results.
For consistency, I am choosing the number of topics to be the same as the number of topics found with BertopicR.
library(SegmentR)
dtms <- data %>%
make_DTMs(text_var = text_bert,
url_var = permalink,
clean_text = TRUE,
remove_stops = TRUE)
ldas <- dtms %>% SegmentR::fit_LDAs(k_opts = length(unique(data$bert_topic)) - 1)
explore <- ldas %>% SegmentR::explore_LDAs()probs <- explore %>% purrr::pluck("probabilities", 1)
linked_probabilities <- SegmentR::topics_link(data, probs)
topics_linked <- linked_probabilities %>%
SegmentR::topics_classify(topic_cutoff = 0.1)I could put a lot more thought into choosing a probability threshold for each post to be categorised into a particular topic but for simplicity of this analysis, I’m going to set it to 0.1.
Some posts won’t meet the threshold to be classified as a particular topic and I am going to align with the Bertopic convention and label these posts as -1.
data_lda <- data %>%
left_join(topics_linked) %>%
rename(lda_topic = topic) %>%
mutate(lda_topic = stringr::str_replace_all(lda_topic, "topic_", ""),
lda_topic = ifelse(is.na(lda_topic), -1, lda_topic)) Joining with `by = join_by(text, clean_text, date, sender_screen_name,
conversation_id, universal_message_id, permalink, sentiment, message_type,
parent_universal_message_id, social_network, doc_id, text_bert, char_length,
bert_topic)`
Let’s look at how similar lda and bert topics are using sklearn’s adjusted rand index.
sklearn <- reticulate::import("sklearn")
sklearn$metrics$adjusted_rand_score(data_lda$bert_topic, data_lda$lda_topic)[1] 0.001816008
The adjusted rand score can fall from -1 to +1 with -1 being very dissimilar, +1, the same, and 0, while in the middle, would equate to agreement due to randomness. A score of 0.002 indicates that our topics are not similar at all and that any similarity between topics is likely down to chance. We already knew that LDA and BERTopic would likely result in different topic groupings due to the difference in how they identify topics, this serves to further confirm it.
Now let’s look at if post length is playing a role in topic assignment for LDA topics.
data_lda %>%
mutate(lda_topic = as.factor(sort(as.numeric((lda_topic))))) %>%
ggplot(aes(x = lda_topic, y = char_length)) +
geom_boxplot() +
labs(x = "Topic", y = "Character Length") +
theme_minimal() These means and interquartile ranges look a lot more similar here, there is however variance in the number of posts with a high character length in each group. Let’s look at some central tendency measures again to get a better idea of this.
central_tendency_lda <- data_lda %>%
group_by(lda_topic) %>%
summarise(mean_length = mean(char_length),
median_length = median(char_length),
variance = var(char_length),
std = sd(char_length))
central_tendency_lda %>%
tidyr::gather(measure, result, -lda_topic) %>%
ggplot(aes(x = result)) +
geom_histogram() +
facet_wrap(~ measure, scales = "free") +
labs(title = "Topics chosen using SegmentR")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This does look less variable that the BertopicR results, all measures have a tighter distribution.
aov_lda <- data_lda %>%
filter(lda_topic != -1) %>%
mutate(lda_topic = as.factor(lda_topic))
summary(aov(char_length ~ lda_topic, data = aov_lda)) Df Sum Sq Mean Sq F value Pr(>F)
lda_topic 44 8.392e+08 19073491 499.2 <2e-16 ***
Residuals 637545 2.436e+10 38208
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looks like there is a significant difference between the post length in different topics. Let’s have a look at how tokenising texts into sentences might work and we can have more of a qualitative comparison after that.
First we tokenise the data into sentences and then we’ll have a quick look at the post length distribution.
data_tokenised <- data %>%
tidytext::unnest_tokens(output = sentences,
input = text_bert,
token = "sentences",
drop = FALSE,
to_lower = FALSE) %>%
mutate(char_length_token = stringr::str_length(sentences))
data_tokenised %>%
ggplot(aes(y = char_length_token)) +
geom_boxplot()There are still some longer sentences, likely due to lack of punctuation in the original post, but the majority of documents are now less than 250 characters which is good, we have a relatively small range of sentence length. Let’s see if sentences fall into similar topics to their overall post.
I am going to focus on BertopicR here and forget LDAs for now, again, loading in precalculated embeddings.
reduced_token_embeddings <-
readr::read_rds("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/.shortcut-targets-by-id/0BwEyzS8OvJgreXdPNGZKV2tyRjg/Share_Clients/data_science_project_work/Topic Modelling for posts of variable length/data/embeddings/rds_tokenised_data_reduced_embeddings_rev2_V100gpu.rds")Let’s make a new clusterer with a larger min_cluster_size. The reason for this is so that we generate a similar number of topics.
clusterer_tokenised <- bt_make_clusterer_hdbscan(min_cluster_size = 1900L, cluster_selection_method = "leaf")
tokenised_model <- bt_compile_model(embedding_model = bt_empty_embedder(),
reduction_model = bt_empty_reducer(),
clustering_model = clusterer_tokenised)
bt_fit_model(tokenised_model, data_tokenised$sentences, reduced_token_embeddings)tokenised_model$get_topic_info() %>%
select(-Representative_Docs) %>%
DT::datatable(options = list(pageLength = 5))45 topics here and 45 topics in the original analysis! We have no way to know to what extent these topics match up without doing a more qualitative analysis which we’ll get to.
First let’s look at how this affects outliers:
data_joined <- data_tokenised %>%
rename(char_length_post = char_length) %>%
mutate(tokenised_topics = tokenised_model$topics_)
data_joined %>%
filter(tokenised_topics == -1) %>% nrow()/nrow(data_joined)[1] 0.6515969
data %>%
filter(bert_topic == -1) %>% nrow()/nrow(data)[1] 0.5589093
So when we tokenise the data into sentences ~65% of the sentences are classified as outliers while ~56% of posts are classified as outliers when we perform topic modelling at a post level. It might make sense that there are more outliers when looking at individual sentences as there are likely lots of filler sentences that don’t necessarily mean anything.
If we disregard outliers for now, how many documents have their tokenised sentences grouped in different topics.
data_joined %>%
filter(tokenised_topics != -1) %>%
pull(doc_id) %>% unique() %>% length()[1] 133469
data_joined %>%
filter(tokenised_topics != -1) %>%
group_by(doc_id) %>%
mutate(num_topics_per_doc = n_distinct(tokenised_topics)) %>% select(doc_id, tokenised_topics, bert_topic, num_topics_per_doc) %>%
filter(num_topics_per_doc > 1) %>% pull(doc_id) %>% unique() %>% length()[1] 27068
27068 posts have their tokenised sentences assigned to more than 1 group, so we could assume that 27068 posts contain more than 1 topic. This is about 20% of all posts (excluding outliers).
Before look at how the sentence tokenised topics match with post level topics, let’s have a quick look at how sentences of different lengths are distributed across topics.
data_joined %>%
mutate(tokenised_topics = as.factor(tokenised_topics)) %>%
ggplot(aes(x = tokenised_topics, y = char_length_token)) +
geom_boxplot() +
labs(x = "Topic", y = "Character Length") +
theme_minimal() Again, there are some topics (3, 9, 19, 23…) that seem to have a much lower sentence length distribution than other topics.
central_tendency_tokenised <- data_joined %>%
group_by(tokenised_topics) %>%
summarise(mean_length = mean(char_length_token),
median_length = median(char_length_token),
variance = var(char_length_token),
std = sd(char_length_token))
central_tendency_tokenised %>%
tidyr::gather(measure, result, -tokenised_topics) %>%
ggplot(aes(x = result)) +
geom_histogram() +
facet_wrap(~ measure, scales = "free") +
labs(title = "Topics chosen using SegmentR")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(aov(tokenised_topics ~ char_length_token, data_joined)) Df Sum Sq Mean Sq F value Pr(>F)
char_length_token 1 7 7.29 0.065 0.799
Residuals 569462 64361604 113.02
Mean and median token length have a much lower range and a tighter distribution than when we looked at overall post length, however we can see outliers on the lower end of the distribution. Unsurprisingly, tokenising into sentences seems to have removed outliers on the upper end of the distribution. The anova did not find a significant difference between character length in different topics.
It looks like tokenising into sentences might have removed or at least reduced the affect of post length on topic assignment. Let’s have a look at the actual results we get by tokenising into sentences - is it worth doing?
There’s a few things we can look at here: - Are posts that were in the outlier topic, now getting sentences put into another topic? - Are posts that were classified having their individual sentences put in the outlier category? - Are sentences from certain topics tending to go to the same topic after tokenisation into sentences?
We can create a contingency table to see the topics tokenised sentences would be assigned to based on whether topic modelling is performed at a post or sentence token level.
contingency_table <- table(data_joined$tokenised_topics, data_joined$bert_topic)
# contingency_table[1:8,]
contingency_table %>%
as.data.frame() %>%
rename(tokenised = Var1, post_level = Var2) %>%
group_by(tokenised) %>%
mutate(prop = Freq/sum(Freq)) %>%
ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Contingency Table Heatmap", x = "Sentence Tokenised Topic", y = "Post Level Topic")The rows in the heatmap above correspond to the sentence tokenised topic and the columns correspond to the topic the sentences would have been classified as in the post level topic modelling.
One big takeaway is that most outliers, remain outliers.
data_joined %>%
filter(bert_topic == -1, tokenised_topics == -1) %>% distinct(doc_id) %>% nrow()[1] 120860
post_outlier_not_token_outlier <- data_joined %>%
filter(bert_topic == -1, tokenised_topics != -1) %>%
distinct(doc_id) %>% pull(doc_id)
data_joined %>%
filter(bert_topic == -1, tokenised_topics == -1) %>%
anti_join(data.frame(doc_id = post_outlier_not_token_outlier), by = "doc_id") %>%
distinct(doc_id) %>% nrow()[1] 81053
120860 of outlier posts (~87%) have at least one tokenised sentence that remains an outlier after sentence token level topic modelling while 81053 (~58%) of outlier posts remain entirely as outliers after sentence token level topic modelling.
If we look beyond the outliers, a lot of the sentence token level topics seems to correspond to one or more post level topics, eg. topic 1 at the sentence token level seems to correspond relatively well with topic 2 at the post level and topic 3 at the sentence token level corresponds best with topic 8 at the post level.
Let’s consider the sentence token level topic 2, the majority of sentences from this topic would have been classified as either an outlier or part of topic 8 at the post level. If we look first at how posts in topic 8 correspond to sentences in token level topic 2.
data_joined %>%
filter(tokenised_topics == 2, bert_topic == 3) %>%
sample_n(10) %>%
select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))This all appears to be about AI and art. What about the sentences that would have been classified as an outlier at the post level?
data_joined %>%
filter(tokenised_topics == 2, bert_topic == -1) %>%
sample_n(10) %>%
select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))In some instances, particularly where the original post is the same as the sentence tokenised document, it is difficult to see why that post / sentence would have been classified as an outlier document in the post level topic model but as part of topic 2 in the sentence token level topic model. In other cases, particularly with longer posts, it is clear that the relevant sentences from the overall post are classified in the relevant sentence token level topic, ie. the art specific sentences from longer posts are classified as topic 2 at the sentence token level.
Why would some posts that are only 1 sentence long to being with (ie. the same as the sentence tokenised document) fall into an outlier topic at post level topic modelling but fall into a specific topic after sentence token level topic modelling? It could be possible that creating topics based on text tokenised into sentences is creating more focused topics without the “fluff” of irrelevant sentences in posts. This could be why some posts that didn’t fall into the “art” topic in the post level modelling may fall into the “art” topic in sentence token level topic modelling.
What if we look at some of the smaller topics.
tail(contingency_table, 8) %>%
as.data.frame() %>%
rename(tokenised = Var1, post_level = Var2) %>%
group_by(tokenised) %>%
mutate(prop = Freq/sum(Freq)) %>%
ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Contingency Table Heatmap", x = "Sentence Tokenised Topic", y = "Post Level Topic")Some of these sentence token level topics look to be mainly outliers in the post level topic model. Let’s look at topic 44.
data_joined %>%
filter(tokenised_topics == 44) %>%
sample_n(10) %>%
select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))This looks like an example of where tokenisation might result in a meaningless topic. This appears to be a topic of mainly irrelevant sentences from what otherwise could be a post with useful information.
Finally let’s look at a post level topic had a generally higher character length.
central_tendency_bert %>% arrange(desc(median_length))# A tibble: 46 × 5
bert_topic mean_length median_length variance std
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 377. 264. 108491. 329.
2 29 286. 235 60251. 245.
3 41 256. 214 53384. 231.
4 20 259. 209 60605. 246.
5 14 282. 207 77311. 278.
6 43 217. 204 29247. 171.
7 31 201. 200. 13485. 116.
8 21 232. 198 39700. 199.
9 42 200. 196 18974. 138.
10 6 221. 194 36288. 190.
# ℹ 36 more rows
Topic 2 at the post level (not the same topic 2 we just looked at which was topic 2 at sentence token level) has the highest median character length. Let’s look at how it’s sentences get classified after tokenisation
contingency_table[,4, drop = FALSE] %>%
as.data.frame() %>%
rename(tokenised = Var1, post_level = Var2) %>%
# group_by(tokenised) %>%
mutate(prop = Freq/sum(Freq)) %>%
ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Contingency Table Heatmap", x = "Sentence Tokenised Topic", y = "Post Level Topic")The majority of sentences from this topic become outliers at the sentence token level and while sentences are distributed across different topics, there is a clear connection to topic 1 at the sentence token level.
data_joined %>%
filter(bert_topic == 2, tokenised_topics == 1) %>%
sample_n(10) %>%
select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))data_joined %>%
filter(bert_topic == 2, tokenised_topics == -1) %>%
sample_n(10) %>%
select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))This topic seem to be about gaming and it would seem that some of the sentences that are classified as an outlier at the token level could make sense in the actual gaming topic.
We can compare the two methods of topic classifications using the adjusted rand index from the python sklearn package.
bert_topic <- data_joined$bert_topic
tokenised_topic <- data_joined$tokenised_topics
sklearn$metrics$adjusted_rand_score(bert_topic, tokenised_topic)[1] 0.142766
0.14 would indicate that our clusters aren’t that well aligned across analysis and is a numerical representation of the fact that there is a difference in the clusters we obtain by modelling of data that has been tokenised into sentences to the clusters we obtain by clustering on raw posts.
During this analysis I accidentally embedded and reduced those embeddings on lowercase sentence tokenised data and said I’d take this opportunity to look at what difference this might make. Let this be a warning - unnest_tokens defaults to lowercasing tokenised text.
data_lowercase_tokenised <- data %>%
tidytext::unnest_tokens(output = sentences,
input = text_bert,
token = "sentences",
drop = FALSE) %>%
mutate(char_length_token = stringr::str_length(sentences))
lowercase_reduced_token_embeddings <-
readr::read_rds("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/.shortcut-targets-by-id/0BwEyzS8OvJgreXdPNGZKV2tyRjg/Share_Clients/data_science_project_work/Topic Modelling for posts of variable length/data/embeddings/tokenised_data_reduced_embeddings_posts_over_2000_chars_removed_V100gpu.rds")To what extent to the embeddings change?
reduced_token_embeddings[1:5]# A tibble: 569,464 × 5
`0` `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 8.82 4.87 4.34 2.65 5.12
2 11.4 7.61 4.57 2.31 4.19
3 9.44 5.60 4.60 4.20 5.92
4 10.3 7.78 3.59 2.12 4.06
5 11.1 9.56 5.63 1.89 5.76
6 10.7 5.80 3.20 4.19 6.07
7 10.8 5.15 4.04 3.05 4.49
8 11.3 7.92 4.36 2.71 4.25
9 11.1 7.60 4.52 3.17 4.23
10 11.0 8.20 3.88 2.54 4.08
# ℹ 569,454 more rows
lowercase_reduced_token_embeddings[1:5]# A tibble: 569,464 × 5
`0` `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 6.61 5.58 4.96 4.18 4.53
2 9.43 5.85 5.02 3.78 4.57
3 7.01 4.50 4.07 4.55 3.62
4 9.07 6.27 4.60 4.34 5.47
5 9.41 6.83 4.65 2.71 2.68
6 7.86 3.60 2.93 4.65 4.54
7 7.97 4.67 4.64 4.10 5.25
8 9.62 5.71 4.77 4.10 4.60
9 9.39 5.32 4.77 4.59 4.66
10 9.64 5.97 4.52 4.43 4.97
# ℹ 569,454 more rows
Let’s use the same clusterer as we used for the original sentence tokenised data and see if it changes anything here.
lowercase_tokenised_model <- bt_compile_model(embedding_model = bt_empty_embedder(),
reduction_model = bt_empty_reducer(),
clustering_model = clusterer_tokenised)
bt_fit_model(lowercase_tokenised_model, data_lowercase_tokenised$sentences, lowercase_reduced_token_embeddings)tokenised_model$get_topic_info() %>%
select(-Representative_Docs) %>%
DT::datatable(options = list(pageLength = 5))45 topics here and 45 topics in the original sentence tokenised analysis, this is a good start! We should look at how well these topics match up.
data_lowercase_joined <- data_joined %>%
mutate(tokenised_lowercase_topics = lowercase_tokenised_model$topics_) First let’s look at how this affects outliers:
lowercase_contingency_table <- table(data_lowercase_joined$tokenised_topics, data_lowercase_joined$tokenised_lowercase_topics)
lowercase_contingency_table[1:8,] %>%
as.data.frame() %>%
rename(tokenised = Var1, post_level = Var2) %>%
group_by(tokenised) %>%
mutate(prop = Freq/sum(Freq)) %>%
ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Contingency Table Heatmap", x = "Lowercase Text, Sentence Tokenised Topic", y = "Standard Text, Sentence Tokenised Topic")Here the rows represent the original sentence tokenised topics and the columns represent the lowercase topics. Looking at both the outlier row and column it is clear that there is disagreement of outlier tokens in both directions. However, disregarding outliers, there is a generally high agreement of topics which tokens belong to.
If we take the first few columns for the original sentence tokenised topic 6:
lowercase_contingency_table[8, 1:12] -1 0 1 2 3 4 5 6 7 8 9 10
2147 11 1 0 0 0 1 1 5 3 4958 1
4958 (~69%) of this topic correlates to topic 9 of the lowercase topics, the other 30% of topic 6 becomes an outlier when topic modelling with the lowercase data. This is a trend that holds across the data. The alignment of topics across the two dataset does result in one question - why is the only consistent disagreement between the two datasets in relation to outlier classification? This really highlights how careful we should be when dealing with outliers in topic modelling.
Again we can look at the adjusted rand index.
lowercase_tokenised_topic <- data_lowercase_joined$tokenised_lowercase_topics
sklearn$metrics$adjusted_rand_score(tokenised_topic, lowercase_tokenised_topic)[1] 0.5775573
Here we get an an adjusted rand score of 0.58. The outlier disagreement between clustering methods is likely the main contributor to this score not being closer to 1.
It was apparent after post level topic modelling that at least one topic (topic 8) seemed to be classified based on posts that began with a mention. We’ll look now at if this is the case across other topics and what happens if we remove the mention.
First let’s see if this is the case for any other topic. We’ll disregard outliers for this and arbitrarily set the threshold for posts per topic beginning with a mention as 80%.
data %>%
group_by(bert_topic) %>%
mutate(topic_count = n()) %>%
filter(str_detect(text_bert, "^@"), bert_topic != -1) %>%
mutate(mention_topic_count = n(),
mention_prop = mention_topic_count/topic_count) %>%
filter(mention_prop > 0.8) %>% distinct(bert_topic)# A tibble: 2 × 1
# Groups: bert_topic [2]
bert_topic
<dbl>
1 3
2 8
We had already identified topic 8 in the above analysis. It looks like topic 3 could also be grouped based on mentions. If you remember, topic 3 was made up of abnormally short posts… Let’s look at topic 3 and see what it is about.
data %>%
filter(bert_topic == 3) %>%
sample_n(10) %>%
select(text_bert) %>% DT::datatable(options = list(pageLength = 5))This seems to be a topic based on people complimenting others. I wonder if it would still be a topic if mentions are removed?
no_mention_reduced_embeddings <- readr::read_rds("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/.shortcut-targets-by-id/0BwEyzS8OvJgreXdPNGZKV2tyRjg/Share_Clients/data_science_project_work/Topic Modelling for posts of variable length/data/embeddings/rds_reduced_embeddings_no_mentions_V100gpu.rds")data_no_mentions <- data %>%
mutate(text_no_mentions = text_bert,
text_no_mentions = str_remove_all(text_no_mentions, "\\B@\\w+")) %>%
LimpiaR::limpiar_spaces(text_no_mentions)
no_mentions_model <- bt_compile_model(embedding_model = bt_empty_embedder(), # skip embedding step
reduction_model = bt_empty_reducer(), # skip reducing
clustering_model = clusterer) # perform clustering again
bt_fit_model(no_mentions_model, data_no_mentions$text_no_mentions, no_mention_reduced_embeddings)no_mentions_model$get_topic_info() %>% select(-Representative_Docs) %>%
DT::datatable(options = list(pageLength = 5))52 topics this time instead of 45, removing mentions obviously changes things.
Let’s look at what happens to to topic 3 and 8 which appeared to be grouped based on mentions.
data_no_mentions <- data_no_mentions %>%
mutate(no_mention_topic = no_mentions_model$topics_)
no_mention_contingency <- table(data_no_mentions$bert_topic,
data_no_mentions$no_mention_topic)
no_mention_contingency[c(5,10),] %>% t()
3 8
-1 3174 2521
0 14 20
1 7 2
2 5 6
3 16 7
4 1 3
5 1 1
6 1 20
7 33 5
8 2 1
9 2 336
10 7 1
11 18 5
12 6 3
13 262 4
14 11 10
15 0 3
16 3 7
17 131 14
18 0 0
19 6 0
20 0 1
21 1 5
22 2 0
23 2 1
24 0 159
25 0 5
26 37 1
27 1 2
28 3 2
29 0 2
30 0 0
31 0 3
32 949 1
33 3 33
34 1 2
35 4 4
36 6 1
37 1 1
38 1 0
39 30 0
40 1 2
41 12 4
42 2 7
43 2 142
44 1 23
45 0 0
46 1 1
47 1 4
48 37 0
49 5 6
50 1 1
51 2 6
They both become mainly outliers! Topic 3 does have 949 appearances in the no mention topic 32, let’s look at that.
data_no_mentions %>%
filter(no_mention_topic == 32, bert_topic == 3) %>%
sample_n(10) %>%
select(text_bert, text_no_mentions) %>% DT::datatable(options = list(pageLength = 5))This seems to all people congratulating others and is probably a fair topic. Looks like by removing mentions has resulted in only the posts that should be grouped together being grouped together.
What affect does this have on the rest of the data?
no_mention_contingency[1:20,] %>%
as.data.frame() %>%
rename(tokenised = Var1, post_level = Var2) %>%
group_by(tokenised) %>%
mutate(prop = Freq/sum(Freq)) %>%
ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Contingency Table Heatmap", x = "Original Post Level Topic", y = "Mentions removed Post Level Topic")Some of the original topics seem to split across multiple topics in this analysis (eg. topic 0 and topic 2), but for the most part the original topics map to only one topic in this analysis. It looks like, similar to when we perform the analysis in lower case, the only place there is high levels of disagreement is when it comes to outliers. Another indicator that outliers are hard and we need to be careful when messing with them!
Again we can look at the adjusted rand index.
no_mentions_topic <- data_no_mentions$no_mention_topic
bert_topic <- data_no_mentions$bert_topic
sklearn$metrics$adjusted_rand_score(bert_topic, no_mentions_topic)[1] 0.4031806
Here we get an an adjusted rand score of 0.40. The outlier disagreement and the dissolution of some topics that appeared to be grouped based on mentions are likely the main contributors to this score not being closer to 1.
Finally, is post length still affecting topics? Looks like it is.
data_no_mentions %>%
mutate(char_length_no_mention = str_length(text_no_mentions),
no_mention_topic = as.factor(no_mention_topic)) %>%
ggplot(aes(x = no_mention_topic, y = char_length_no_mention)) +
geom_boxplot() +
labs(x = "Topic", y = "Character Length") +
theme_minimal() central_tendency_no_mention <- data_no_mentions %>%
mutate(char_length_no_mention = str_length(text_no_mentions)) %>%
group_by(no_mention_topic) %>%
summarise(mean_length = mean(char_length_no_mention),
median_length = median(char_length_no_mention),
variance = var(char_length_no_mention),
std = sd(char_length_no_mention))
central_tendency_no_mention %>%
tidyr::gather(measure, result, -no_mention_topic) %>%
ggplot(aes(x = result)) +
geom_histogram() +
facet_wrap(~ measure, scales = "free") +
labs(title = "Topics chosen using BertopicR")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(aov(no_mention_topic ~ char_length_no_mention, data_no_mentions %>%
mutate(char_length_no_mention = str_length(text_no_mentions)))) Df Sum Sq Mean Sq F value Pr(>F)
char_length_no_mention 1 95929 95929 485.6 <2e-16 ***
Residuals 249008 49195762 198
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What affect is post length having on topic modelling? The fact that across each analysis, the post length distribution across topics is so varied would suggest that it does. That said, the topics themselves appear to be valid and what is to say that they shouldn’t be grouped like this regardless of post length. When we looked at the contingency table after topic modelling at the sentence token level, it was generally the case that sentences from post level topics mapped to the same sentence token level post and to the outlier category. This would suggest that topic modelling at the post level isn’t actually a terrible generalisation. While we saw that about 20% of posts contained more than one topic, when we look at individual topics, the consesous across analyses was pretty good.
Are posts that were in the outlier topic, now getting sentences put into another topic? For the most part, yes. 42% (see above) of outlier posts have at least one tokenised sentence assigned to a specific topic after tokenisation into sentences. That said, we did see that not all topics were overly meaningful, eg. topic 44 was made up of sentences that were mainly outliers at the post level and it did not provide any useful information.
Are posts that were classified as a particular topic having their individual sentences put in the outlier category? Yes! A quick glance at the first row of our contingency_table tells us that a large number of sentences from every post level topic are being classified as outliers after sentence token level topic modelling.
Are sentences from certain topics tending to go to the same topic after tokenisation into sentences? In a lot of cases, yes eg. we looked at how topic 1 at the sentnece token level seems to correspond relatively well with topic 2 at the post level and topic 3 at the sentence token level corresponds best with topic 8 at the post level.
Should we tokenise our data into sentences? The big draw back of tokenising data for the purpose of modelling on smaller text chunks is processing time, it massively increases the dimensions of our embeddings (it more than doubled here). If we want to embed in r, this is not really feasible, if we are happy to embed using colab and load results into r, it only takes a few minutes and shouldn’t be a barrier. Tokenising data into sentences can really only improve our results by granulising them, however one question is how we would deal with posts that fall into multiple topics in terms of presenting them to a client. Another big advantage of tokenising data into sentences is that when you only have to scan one sentence, it makes it a lot easier for us as users to interpret data and understand what topics are about.
Is there different results when embedding lowercase vs standard case data? Lowercasing the data does not appear to affect the identification of clusters, but does disagree on what is classified as an outlier. It seems that there has been consistent disagreement on outliers across this analysis and without a ground truth it is tough to say what best practice would be.
Should we remove mentions? The quick analysis above would suggest that sometimes things are grouped based on mentions. This may not always be a bad things and maybe sometimes it makes sense to form a topic around a mention. Removing mentions didn’t seem to alter our results too much and so if you feel it makes sense to remove them for your dataset, it should be ok to do.
Other comments: There has been consistent disagreement on outliers across this analysis. Without a ground truth it is impossible to say definitively what should and should not be an outlier and so the biggest takeaway for me is to be conscious when performing topic modelling and changing the outlier definition.