Post Length Investigation

1 What are we trying to achieve?

Something we have long suspected as a data science department is that post length could play a role in topic assignment during topic modelling. We frequently perform topic modelling with posts that range from short tweets to long reddit and forum posts. Do longer posts tend towards a particular topic? Do shorter posts? It makes sense that any post, but particularly longer posts, could touch on multiple topics and so it begs the question, are posts (particularly longer posts) being accurately summarised? Should we be tokenising posts into sentences / paragraphs / particular character counts so that all posts are of similar length?

This is a rather rambling list of things we are unsure about and will aim to answer, or at least interrogate, in this document.

We’ll first perform topic modelling using our current pipeline and how see how post length varies from topic to topic and then we will look at tokenising posts and how that changes the results.

There are two principle packages we use to perform topic modelling, SegmentR and BertopicR. We will perform topic modelling using both tools and look at how posts of varying length are distributed across topics in each.

Note: Unless otherwise stated, all embedding (all-mpnet-base-v2) and dimension reduction (umap) was performed on colab using the V100 gpu. This was done in the interest of computation time.

Code
library(dplyr)
library(BertopicR)
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.1
Code
library(stringr)
library(SegmentR)
set.seed(12)

First let’s load in the raw data and look at the variation of text length.

Code
# load the data
data <- readr::read_csv("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/My Drive/data_science_project_work/microsoft/project_work/688_ai_landscape_q1_fy24/data/clean_data/688_sampled_data_250k.csv") %>%
  mutate(text_bert = text) %>%
  ParseR::clean_text(text_var = text_bert,
                     tolower = FALSE,
                     hashtags = FALSE,
                     mentions = TRUE,
                     punctuation = TRUE,
                     emojis = FALSE,
                     digits = TRUE) %>%
  mutate(char_length = stringr::str_length(text_bert)) 
Code
data %>% 
ggplot(aes(y = char_length)) +
  geom_boxplot() +
  labs(x = "Character Length") +
  theme_minimal()  

There are almost no posts with over 2000 characters and after looking att them, they are all spam. Let’s remove these and load in our precalculated embeddings (which exclude these posts).

Code
data <- data %>%
  filter(char_length <= 2000)

reduced_embeddings <- readr::read_rds("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/.shortcut-targets-by-id/0BwEyzS8OvJgreXdPNGZKV2tyRjg/Share_Clients/data_science_project_work/Topic Modelling for posts of variable length/data/embeddings/full_post_data_reduced_embeddings_V100gpu.rds")

Now let’s look at how we would cluster

Code
# make the clusterer
clusterer <- bt_make_clusterer_hdbscan(min_cluster_size = 1000L, min_samples = 5L, cluster_selection_method = "leaf")

# how many clusters would this make?
clusters <- bt_do_clustering(clusterer, reduced_embeddings)
table(clusters)
clusters
    -1      0      1      2      3      4      5      6      7      8      9 
139174   7795   5683   4872   4806   4498   4227   4122   3579   3388   2915 
    10     11     12     13     14     15     16     17     18     19     20 
  2905   2745   2734   2611   2587   2568   2420   2268   2246   2217   2214 
    21     22     23     24     25     26     27     28     29     30     31 
  2095   1990   1972   1946   1927   1895   1817   1817   1682   1659   1584 
    32     33     34     35     36     37     38     39     40     41     42 
  1574   1483   1477   1334   1241   1230   1223   1223   1093   1065   1059 
    43     44 
  1036   1014 

45 topics looks like a some-what manageable number, let’s go with that. Let’s fit the model.

Code
model <- bt_compile_model(embedding_model = bt_empty_embedder(), # skip embedding step
                          reduction_model = bt_empty_reducer(), # skip reducing
                          clustering_model = clusterer) # perform clustering again 

bt_fit_model(model, data$text_bert, reduced_embeddings)
Code
model$get_topic_info() %>% select(-Representative_Docs) %>% 
  DT::datatable(options = list(pageLength = 5))
Code
data <- data %>%
  mutate(bert_topic = model$get_document_info(data$text_bert)$Topic) 

But how do posts with different lengths distribute across the topics? Let’s first look at the distribution of post length per topic.

Code
data %>% 
  mutate(bert_topic = as.factor(bert_topic)) %>%
ggplot(aes(x = bert_topic, y = char_length)) +
  geom_boxplot() +
  labs(x = "Topic", y = "Character Length") +
  theme_minimal()  

It looks like some topics contain posts with quite a low post length, eg. topic 8, 10 and 35, while some topics contain posts with a generally higher post length.

Let’s try to put some numbers on this.

Code
central_tendency_bert <- data %>%
  group_by(bert_topic) %>%
  summarise(mean_length = mean(char_length),
         median_length = median(char_length),
         variance = var(char_length),
         std = sd(char_length))


central_tendency_bert %>%
  tidyr::gather(measure, result, -bert_topic) %>%
  ggplot(aes(x = result)) +
  geom_histogram() +
  facet_wrap(~ measure, scales = "free") +
  geom_histogram(data = subset(central_tendency_bert %>%
  tidyr::gather(measure, result, -bert_topic), measure == "variance"),
                 aes(x = result), binwidth = 1000) +
  labs(title = "Topics chosen using BertopicR")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

While the bulk of the data has a mean post length around 200 and a median between 100 and 200, there does seem to be a high range of variance in the data and there are definitely some topics that look like they have an outlier means and medians. There also seems to be one topic with a 0 variance, this wouldn’t make any sense, and that’s because it’s not 0 at all! The binwidth is automatically selected by the geom_histogram function and the range of the data simply means that it has a very large binwidth that encompasses 0.

Code
central_tendency_bert %>%
  arrange(variance) %>%
  head(5)
# A tibble: 5 × 5
  bert_topic mean_length median_length variance   std
       <dbl>       <dbl>         <dbl>    <dbl> <dbl>
1          8        77.2          68      1484.  38.5
2          3        84.4          67      3958.  62.9
3         35        72.4          52.5    5150.  71.8
4         10       103.           78      7848.  88.6
5          9       139.          106     12414. 111. 

The actual variance is 1484 and refers to topic 8. Maybe we should look at topic 8 a bit closer.

Code
data %>%
  filter(bert_topic == 8) %>%
  select(text_bert) %>% sample_n(10) %>%
  DT::datatable(options = list(pageLength = 5))

All looks ok, the topic seems to be about AI, maybe AI as referred to in a slightly negative or fearful way? It is clear that all of these posts are short and generally seem to be replies to something else. I wonder is it grouping based on mentions?

Code
data %>%
  filter(bert_topic == 8, str_detect(text_bert, "^@")) %>% nrow()
[1] 3331
Code
data %>% filter(bert_topic == 8) %>% nrow()
[1] 3388

98% of these posts begin with a mention! If we removed mentions from the data before embedding, would this topic disappear? This is something I might look at more later.

But back to the stats… We’ll perform an anova across topics to see if there is a statistical difference between groups.

Code
aov_df <- data %>%
  mutate(bert_topic = as.factor(bert_topic))

summary(aov(char_length ~ bert_topic, data = aov_df))
                Df    Sum Sq  Mean Sq F value Pr(>F)    
bert_topic      45 4.954e+08 11009032   354.3 <2e-16 ***
Residuals   248964 7.736e+09    31072                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Definitely significant…

If we ignore the outlier category?

Code
aov_df <- data %>%
  filter(bert_topic != -1) %>%
  mutate(bert_topic = as.factor(bert_topic))

summary(aov(char_length ~ bert_topic, data = aov_df, var.equal = FALSE))
Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
 extra argument 'var.equal' will be disregarded
                Df    Sum Sq  Mean Sq F value Pr(>F)    
bert_topic      44 4.634e+08 10532595   295.5 <2e-16 ***
Residuals   109791 3.914e+09    35646                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Still significant, the outlier topic isn’t causing this.

What does this mean? It is possible that messages are being pulled from from forums where there are long threads (of long messages) all replying to and discussing the same thing and thus belong in the same topic, and that shorter posts are being grouped together because they’re a bit more ambiguous and make a topic of “AI mentions we can’t classify” or something like that. It is also possible that the length of posts is affecting their embeddings and thus the clusters they are assigned to. Let’s see does LDA give similar results.

1.1 Modelling with SegmentR

For consistency, I am choosing the number of topics to be the same as the number of topics found with BertopicR.

Code
library(SegmentR)

dtms <- data %>%
  make_DTMs(text_var = text_bert,
            url_var = permalink,
            clean_text = TRUE,
            remove_stops = TRUE)

ldas <- dtms %>% SegmentR::fit_LDAs(k_opts = length(unique(data$bert_topic)) - 1)

explore <- ldas %>% SegmentR::explore_LDAs()
Code
probs <- explore %>% purrr::pluck("probabilities", 1)

linked_probabilities <- SegmentR::topics_link(data, probs)

topics_linked <- linked_probabilities %>%
  SegmentR::topics_classify(topic_cutoff =  0.1)

I could put a lot more thought into choosing a probability threshold for each post to be categorised into a particular topic but for simplicity of this analysis, I’m going to set it to 0.1.

Some posts won’t meet the threshold to be classified as a particular topic and I am going to align with the Bertopic convention and label these posts as -1.

Code
data_lda <- data %>%
  left_join(topics_linked) %>%
  rename(lda_topic = topic) %>%
  mutate(lda_topic = stringr::str_replace_all(lda_topic, "topic_", ""),
    lda_topic = ifelse(is.na(lda_topic), -1, lda_topic)) 
Joining with `by = join_by(text, clean_text, date, sender_screen_name,
conversation_id, universal_message_id, permalink, sentiment, message_type,
parent_universal_message_id, social_network, doc_id, text_bert, char_length,
bert_topic)`

Let’s look at how similar lda and bert topics are using sklearn’s adjusted rand index.

Code
sklearn <- reticulate::import("sklearn")
sklearn$metrics$adjusted_rand_score(data_lda$bert_topic, data_lda$lda_topic)
[1] 0.001816008

The adjusted rand score can fall from -1 to +1 with -1 being very dissimilar, +1, the same, and 0, while in the middle, would equate to agreement due to randomness. A score of 0.002 indicates that our topics are not similar at all and that any similarity between topics is likely down to chance. We already knew that LDA and BERTopic would likely result in different topic groupings due to the difference in how they identify topics, this serves to further confirm it.

Now let’s look at if post length is playing a role in topic assignment for LDA topics.

Code
data_lda %>% 
  mutate(lda_topic = as.factor(sort(as.numeric((lda_topic))))) %>%
ggplot(aes(x = lda_topic, y = char_length)) +
  geom_boxplot() +
  labs(x = "Topic", y = "Character Length") +
  theme_minimal()  

These means and interquartile ranges look a lot more similar here, there is however variance in the number of posts with a high character length in each group. Let’s look at some central tendency measures again to get a better idea of this.

Code
central_tendency_lda <- data_lda %>%
  group_by(lda_topic) %>%
  summarise(mean_length = mean(char_length),
         median_length = median(char_length),
         variance = var(char_length),
         std = sd(char_length))


central_tendency_lda %>%
  tidyr::gather(measure, result, -lda_topic) %>%
  ggplot(aes(x = result)) +
  geom_histogram() +
  facet_wrap(~ measure, scales = "free") +
  labs(title = "Topics chosen using SegmentR")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This does look less variable that the BertopicR results, all measures have a tighter distribution.

Code
aov_lda <- data_lda %>%
  filter(lda_topic != -1) %>%
  mutate(lda_topic = as.factor(lda_topic))

summary(aov(char_length ~ lda_topic, data = aov_lda))
                Df    Sum Sq  Mean Sq F value Pr(>F)    
lda_topic       44 8.392e+08 19073491   499.2 <2e-16 ***
Residuals   637545 2.436e+10    38208                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like there is a significant difference between the post length in different topics. Let’s have a look at how tokenising texts into sentences might work and we can have more of a qualitative comparison after that.

1.2 Topic modelling with tokenised text

First we tokenise the data into sentences and then we’ll have a quick look at the post length distribution.

Code
data_tokenised <- data %>% 
  tidytext::unnest_tokens(output = sentences,
                          input = text_bert,
                          token = "sentences",
                          drop = FALSE,
                          to_lower = FALSE) %>%
  mutate(char_length_token = stringr::str_length(sentences))

data_tokenised %>%
  ggplot(aes(y = char_length_token)) +
  geom_boxplot()

There are still some longer sentences, likely due to lack of punctuation in the original post, but the majority of documents are now less than 250 characters which is good, we have a relatively small range of sentence length. Let’s see if sentences fall into similar topics to their overall post.

I am going to focus on BertopicR here and forget LDAs for now, again, loading in precalculated embeddings.

Code
reduced_token_embeddings <-
  readr::read_rds("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/.shortcut-targets-by-id/0BwEyzS8OvJgreXdPNGZKV2tyRjg/Share_Clients/data_science_project_work/Topic Modelling for posts of variable length/data/embeddings/rds_tokenised_data_reduced_embeddings_rev2_V100gpu.rds")

Let’s make a new clusterer with a larger min_cluster_size. The reason for this is so that we generate a similar number of topics.

Code
clusterer_tokenised <- bt_make_clusterer_hdbscan(min_cluster_size = 1900L, cluster_selection_method = "leaf")

tokenised_model <- bt_compile_model(embedding_model = bt_empty_embedder(),
                                    reduction_model = bt_empty_reducer(),
                                    clustering_model = clusterer_tokenised)

bt_fit_model(tokenised_model, data_tokenised$sentences, reduced_token_embeddings)
Code
tokenised_model$get_topic_info() %>%
  select(-Representative_Docs) %>%
  DT::datatable(options = list(pageLength = 5))

45 topics here and 45 topics in the original analysis! We have no way to know to what extent these topics match up without doing a more qualitative analysis which we’ll get to.

First let’s look at how this affects outliers:

Code
data_joined <- data_tokenised %>%
  rename(char_length_post = char_length) %>%
  mutate(tokenised_topics = tokenised_model$topics_) 

data_joined %>%
  filter(tokenised_topics == -1) %>% nrow()/nrow(data_joined)
[1] 0.6515969
Code
data %>%
  filter(bert_topic == -1) %>% nrow()/nrow(data)
[1] 0.5589093

So when we tokenise the data into sentences ~65% of the sentences are classified as outliers while ~56% of posts are classified as outliers when we perform topic modelling at a post level. It might make sense that there are more outliers when looking at individual sentences as there are likely lots of filler sentences that don’t necessarily mean anything.

If we disregard outliers for now, how many documents have their tokenised sentences grouped in different topics.

Code
data_joined %>% 
  filter(tokenised_topics != -1) %>% 
  pull(doc_id) %>% unique() %>% length()
[1] 133469
Code
data_joined %>% 
  filter(tokenised_topics != -1) %>%
  group_by(doc_id) %>% 
  mutate(num_topics_per_doc = n_distinct(tokenised_topics)) %>% select(doc_id, tokenised_topics, bert_topic, num_topics_per_doc) %>% 
  filter(num_topics_per_doc > 1) %>% pull(doc_id) %>% unique() %>% length()
[1] 27068

27068 posts have their tokenised sentences assigned to more than 1 group, so we could assume that 27068 posts contain more than 1 topic. This is about 20% of all posts (excluding outliers).

Before look at how the sentence tokenised topics match with post level topics, let’s have a quick look at how sentences of different lengths are distributed across topics.

Code
data_joined %>% 
  mutate(tokenised_topics = as.factor(tokenised_topics)) %>%
ggplot(aes(x = tokenised_topics, y = char_length_token)) +
  geom_boxplot() +
  labs(x = "Topic", y = "Character Length") +
  theme_minimal() 

Again, there are some topics (3, 9, 19, 23…) that seem to have a much lower sentence length distribution than other topics.

Code
central_tendency_tokenised <- data_joined %>%
  group_by(tokenised_topics) %>%
  summarise(mean_length = mean(char_length_token),
         median_length = median(char_length_token),
         variance = var(char_length_token),
         std = sd(char_length_token))

central_tendency_tokenised %>%
  tidyr::gather(measure, result, -tokenised_topics) %>%
  ggplot(aes(x = result)) +
  geom_histogram() +
  facet_wrap(~ measure, scales = "free") +
  labs(title = "Topics chosen using SegmentR")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Code
summary(aov(tokenised_topics ~ char_length_token, data_joined))
                      Df   Sum Sq Mean Sq F value Pr(>F)
char_length_token      1        7    7.29   0.065  0.799
Residuals         569462 64361604  113.02               

Mean and median token length have a much lower range and a tighter distribution than when we looked at overall post length, however we can see outliers on the lower end of the distribution. Unsurprisingly, tokenising into sentences seems to have removed outliers on the upper end of the distribution. The anova did not find a significant difference between character length in different topics.

It looks like tokenising into sentences might have removed or at least reduced the affect of post length on topic assignment. Let’s have a look at the actual results we get by tokenising into sentences - is it worth doing?

There’s a few things we can look at here: - Are posts that were in the outlier topic, now getting sentences put into another topic? - Are posts that were classified having their individual sentences put in the outlier category? - Are sentences from certain topics tending to go to the same topic after tokenisation into sentences?

We can create a contingency table to see the topics tokenised sentences would be assigned to based on whether topic modelling is performed at a post or sentence token level.

Code
contingency_table <- table(data_joined$tokenised_topics, data_joined$bert_topic)

# contingency_table[1:8,]

contingency_table %>%
  as.data.frame() %>%
  rename(tokenised = Var1, post_level = Var2) %>%
  group_by(tokenised) %>% 
  mutate(prop = Freq/sum(Freq)) %>%
  ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Contingency Table Heatmap", x = "Sentence Tokenised Topic", y = "Post Level Topic")

The rows in the heatmap above correspond to the sentence tokenised topic and the columns correspond to the topic the sentences would have been classified as in the post level topic modelling.

One big takeaway is that most outliers, remain outliers.

Code
data_joined %>%
  filter(bert_topic == -1, tokenised_topics == -1) %>% distinct(doc_id) %>% nrow()
[1] 120860
Code
post_outlier_not_token_outlier <- data_joined %>%
  filter(bert_topic == -1, tokenised_topics != -1) %>% 
  distinct(doc_id) %>% pull(doc_id)

data_joined %>%
  filter(bert_topic == -1, tokenised_topics == -1) %>% 
  anti_join(data.frame(doc_id = post_outlier_not_token_outlier), by = "doc_id") %>%
  distinct(doc_id) %>% nrow()
[1] 81053

120860 of outlier posts (~87%) have at least one tokenised sentence that remains an outlier after sentence token level topic modelling while 81053 (~58%) of outlier posts remain entirely as outliers after sentence token level topic modelling.

If we look beyond the outliers, a lot of the sentence token level topics seems to correspond to one or more post level topics, eg. topic 1 at the sentence token level seems to correspond relatively well with topic 2 at the post level and topic 3 at the sentence token level corresponds best with topic 8 at the post level.

Let’s consider the sentence token level topic 2, the majority of sentences from this topic would have been classified as either an outlier or part of topic 8 at the post level. If we look first at how posts in topic 8 correspond to sentences in token level topic 2.

Code
data_joined %>% 
  filter(tokenised_topics == 2, bert_topic == 3) %>%
  sample_n(10) %>%
  select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))

This all appears to be about AI and art. What about the sentences that would have been classified as an outlier at the post level?

Code
data_joined %>% 
  filter(tokenised_topics == 2, bert_topic == -1) %>%
  sample_n(10) %>%
  select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))

In some instances, particularly where the original post is the same as the sentence tokenised document, it is difficult to see why that post / sentence would have been classified as an outlier document in the post level topic model but as part of topic 2 in the sentence token level topic model. In other cases, particularly with longer posts, it is clear that the relevant sentences from the overall post are classified in the relevant sentence token level topic, ie. the art specific sentences from longer posts are classified as topic 2 at the sentence token level.

Why would some posts that are only 1 sentence long to being with (ie. the same as the sentence tokenised document) fall into an outlier topic at post level topic modelling but fall into a specific topic after sentence token level topic modelling? It could be possible that creating topics based on text tokenised into sentences is creating more focused topics without the “fluff” of irrelevant sentences in posts. This could be why some posts that didn’t fall into the “art” topic in the post level modelling may fall into the “art” topic in sentence token level topic modelling.

What if we look at some of the smaller topics.

Code
tail(contingency_table, 8) %>%
  as.data.frame() %>%
  rename(tokenised = Var1, post_level = Var2) %>%
  group_by(tokenised) %>% 
  mutate(prop = Freq/sum(Freq)) %>%
  ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Contingency Table Heatmap", x = "Sentence Tokenised Topic", y = "Post Level Topic")

Some of these sentence token level topics look to be mainly outliers in the post level topic model. Let’s look at topic 44.

Code
data_joined %>% 
  filter(tokenised_topics == 44) %>%
  sample_n(10) %>%
  select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))

This looks like an example of where tokenisation might result in a meaningless topic. This appears to be a topic of mainly irrelevant sentences from what otherwise could be a post with useful information.

Finally let’s look at a post level topic had a generally higher character length.

Code
central_tendency_bert %>% arrange(desc(median_length))
# A tibble: 46 × 5
   bert_topic mean_length median_length variance   std
        <dbl>       <dbl>         <dbl>    <dbl> <dbl>
 1          2        377.          264.  108491.  329.
 2         29        286.          235    60251.  245.
 3         41        256.          214    53384.  231.
 4         20        259.          209    60605.  246.
 5         14        282.          207    77311.  278.
 6         43        217.          204    29247.  171.
 7         31        201.          200.   13485.  116.
 8         21        232.          198    39700.  199.
 9         42        200.          196    18974.  138.
10          6        221.          194    36288.  190.
# ℹ 36 more rows

Topic 2 at the post level (not the same topic 2 we just looked at which was topic 2 at sentence token level) has the highest median character length. Let’s look at how it’s sentences get classified after tokenisation

Code
contingency_table[,4, drop = FALSE] %>%
  as.data.frame() %>%
  rename(tokenised = Var1, post_level = Var2) %>%
  # group_by(tokenised) %>% 
  mutate(prop = Freq/sum(Freq)) %>%
  ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Contingency Table Heatmap", x = "Sentence Tokenised Topic", y = "Post Level Topic")

The majority of sentences from this topic become outliers at the sentence token level and while sentences are distributed across different topics, there is a clear connection to topic 1 at the sentence token level.

Code
data_joined %>% 
  filter(bert_topic == 2, tokenised_topics == 1) %>%
  sample_n(10) %>%
  select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))
Code
data_joined %>% 
  filter(bert_topic == 2, tokenised_topics == -1) %>%
  sample_n(10) %>%
  select(text_bert, sentences, doc_id) %>% DT::datatable(options = list(pageLength = 5))

This topic seem to be about gaming and it would seem that some of the sentences that are classified as an outlier at the token level could make sense in the actual gaming topic.

We can compare the two methods of topic classifications using the adjusted rand index from the python sklearn package.

Code
bert_topic <- data_joined$bert_topic
tokenised_topic <- data_joined$tokenised_topics

sklearn$metrics$adjusted_rand_score(bert_topic, tokenised_topic)
[1] 0.142766

0.14 would indicate that our clusters aren’t that well aligned across analysis and is a numerical representation of the fact that there is a difference in the clusters we obtain by modelling of data that has been tokenised into sentences to the clusters we obtain by clustering on raw posts.

1.3 Extra Analysis 1: What about if we do embeddings on lower case data?

During this analysis I accidentally embedded and reduced those embeddings on lowercase sentence tokenised data and said I’d take this opportunity to look at what difference this might make. Let this be a warning - unnest_tokens defaults to lowercasing tokenised text.

Code
data_lowercase_tokenised <- data %>% 
  tidytext::unnest_tokens(output = sentences,
                          input = text_bert,
                          token = "sentences",
                          drop = FALSE) %>%
  mutate(char_length_token = stringr::str_length(sentences))

lowercase_reduced_token_embeddings <-
  readr::read_rds("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/.shortcut-targets-by-id/0BwEyzS8OvJgreXdPNGZKV2tyRjg/Share_Clients/data_science_project_work/Topic Modelling for posts of variable length/data/embeddings/tokenised_data_reduced_embeddings_posts_over_2000_chars_removed_V100gpu.rds")

To what extent to the embeddings change?

Code
reduced_token_embeddings[1:5]
# A tibble: 569,464 × 5
     `0`   `1`   `2`   `3`   `4`
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1  8.82  4.87  4.34  2.65  5.12
 2 11.4   7.61  4.57  2.31  4.19
 3  9.44  5.60  4.60  4.20  5.92
 4 10.3   7.78  3.59  2.12  4.06
 5 11.1   9.56  5.63  1.89  5.76
 6 10.7   5.80  3.20  4.19  6.07
 7 10.8   5.15  4.04  3.05  4.49
 8 11.3   7.92  4.36  2.71  4.25
 9 11.1   7.60  4.52  3.17  4.23
10 11.0   8.20  3.88  2.54  4.08
# ℹ 569,454 more rows
Code
lowercase_reduced_token_embeddings[1:5]
# A tibble: 569,464 × 5
     `0`   `1`   `2`   `3`   `4`
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1  6.61  5.58  4.96  4.18  4.53
 2  9.43  5.85  5.02  3.78  4.57
 3  7.01  4.50  4.07  4.55  3.62
 4  9.07  6.27  4.60  4.34  5.47
 5  9.41  6.83  4.65  2.71  2.68
 6  7.86  3.60  2.93  4.65  4.54
 7  7.97  4.67  4.64  4.10  5.25
 8  9.62  5.71  4.77  4.10  4.60
 9  9.39  5.32  4.77  4.59  4.66
10  9.64  5.97  4.52  4.43  4.97
# ℹ 569,454 more rows

Let’s use the same clusterer as we used for the original sentence tokenised data and see if it changes anything here.

Code
lowercase_tokenised_model <- bt_compile_model(embedding_model = bt_empty_embedder(),
                                    reduction_model = bt_empty_reducer(),
                                    clustering_model = clusterer_tokenised)

bt_fit_model(lowercase_tokenised_model, data_lowercase_tokenised$sentences, lowercase_reduced_token_embeddings)
Code
tokenised_model$get_topic_info() %>%
  select(-Representative_Docs) %>%
  DT::datatable(options = list(pageLength = 5))

45 topics here and 45 topics in the original sentence tokenised analysis, this is a good start! We should look at how well these topics match up.

Code
data_lowercase_joined <- data_joined %>%
  mutate(tokenised_lowercase_topics = lowercase_tokenised_model$topics_) 

First let’s look at how this affects outliers:

Code
lowercase_contingency_table <- table(data_lowercase_joined$tokenised_topics, data_lowercase_joined$tokenised_lowercase_topics)

lowercase_contingency_table[1:8,] %>%
  as.data.frame() %>%
  rename(tokenised = Var1, post_level = Var2) %>%
  group_by(tokenised) %>% 
  mutate(prop = Freq/sum(Freq)) %>%
  ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Contingency Table Heatmap", x = "Lowercase Text, Sentence Tokenised Topic", y = "Standard Text, Sentence Tokenised Topic")

Here the rows represent the original sentence tokenised topics and the columns represent the lowercase topics. Looking at both the outlier row and column it is clear that there is disagreement of outlier tokens in both directions. However, disregarding outliers, there is a generally high agreement of topics which tokens belong to.

If we take the first few columns for the original sentence tokenised topic 6:

Code
lowercase_contingency_table[8, 1:12]
  -1    0    1    2    3    4    5    6    7    8    9   10 
2147   11    1    0    0    0    1    1    5    3 4958    1 

4958 (~69%) of this topic correlates to topic 9 of the lowercase topics, the other 30% of topic 6 becomes an outlier when topic modelling with the lowercase data. This is a trend that holds across the data. The alignment of topics across the two dataset does result in one question - why is the only consistent disagreement between the two datasets in relation to outlier classification? This really highlights how careful we should be when dealing with outliers in topic modelling.

Again we can look at the adjusted rand index.

Code
lowercase_tokenised_topic <- data_lowercase_joined$tokenised_lowercase_topics

sklearn$metrics$adjusted_rand_score(tokenised_topic, lowercase_tokenised_topic)
[1] 0.5775573

Here we get an an adjusted rand score of 0.58. The outlier disagreement between clustering methods is likely the main contributor to this score not being closer to 1.

1.4 Extra Analysis 2: What about if we remove mentions?

It was apparent after post level topic modelling that at least one topic (topic 8) seemed to be classified based on posts that began with a mention. We’ll look now at if this is the case across other topics and what happens if we remove the mention.

First let’s see if this is the case for any other topic. We’ll disregard outliers for this and arbitrarily set the threshold for posts per topic beginning with a mention as 80%.

Code
data %>%
  group_by(bert_topic) %>%
  mutate(topic_count = n()) %>% 
  filter(str_detect(text_bert, "^@"), bert_topic != -1) %>% 
  mutate(mention_topic_count = n(),
         mention_prop = mention_topic_count/topic_count) %>% 
  filter(mention_prop > 0.8) %>% distinct(bert_topic)
# A tibble: 2 × 1
# Groups:   bert_topic [2]
  bert_topic
       <dbl>
1          3
2          8

We had already identified topic 8 in the above analysis. It looks like topic 3 could also be grouped based on mentions. If you remember, topic 3 was made up of abnormally short posts… Let’s look at topic 3 and see what it is about.

Code
data %>% 
  filter(bert_topic == 3) %>%
  sample_n(10) %>%
  select(text_bert) %>% DT::datatable(options = list(pageLength = 5))

This seems to be a topic based on people complimenting others. I wonder if it would still be a topic if mentions are removed?

Code
no_mention_reduced_embeddings <- readr::read_rds("~/Library/CloudStorage/GoogleDrive-aoife.ryan@sharecreative.com/.shortcut-targets-by-id/0BwEyzS8OvJgreXdPNGZKV2tyRjg/Share_Clients/data_science_project_work/Topic Modelling for posts of variable length/data/embeddings/rds_reduced_embeddings_no_mentions_V100gpu.rds")
Code
data_no_mentions <- data %>%
  mutate(text_no_mentions = text_bert,
         text_no_mentions = str_remove_all(text_no_mentions, "\\B@\\w+")) %>%
  LimpiaR::limpiar_spaces(text_no_mentions)

no_mentions_model <- bt_compile_model(embedding_model = bt_empty_embedder(), # skip embedding step
                                      reduction_model = bt_empty_reducer(), # skip reducing
                                      clustering_model = clusterer) # perform clustering again 

bt_fit_model(no_mentions_model, data_no_mentions$text_no_mentions, no_mention_reduced_embeddings)
Code
no_mentions_model$get_topic_info() %>% select(-Representative_Docs) %>% 
  DT::datatable(options = list(pageLength = 5))

52 topics this time instead of 45, removing mentions obviously changes things.

Let’s look at what happens to to topic 3 and 8 which appeared to be grouped based on mentions.

Code
data_no_mentions <- data_no_mentions %>%
  mutate(no_mention_topic = no_mentions_model$topics_)

no_mention_contingency <- table(data_no_mentions$bert_topic,
                                data_no_mentions$no_mention_topic)

no_mention_contingency[c(5,10),] %>% t()
    
        3    8
  -1 3174 2521
  0    14   20
  1     7    2
  2     5    6
  3    16    7
  4     1    3
  5     1    1
  6     1   20
  7    33    5
  8     2    1
  9     2  336
  10    7    1
  11   18    5
  12    6    3
  13  262    4
  14   11   10
  15    0    3
  16    3    7
  17  131   14
  18    0    0
  19    6    0
  20    0    1
  21    1    5
  22    2    0
  23    2    1
  24    0  159
  25    0    5
  26   37    1
  27    1    2
  28    3    2
  29    0    2
  30    0    0
  31    0    3
  32  949    1
  33    3   33
  34    1    2
  35    4    4
  36    6    1
  37    1    1
  38    1    0
  39   30    0
  40    1    2
  41   12    4
  42    2    7
  43    2  142
  44    1   23
  45    0    0
  46    1    1
  47    1    4
  48   37    0
  49    5    6
  50    1    1
  51    2    6

They both become mainly outliers! Topic 3 does have 949 appearances in the no mention topic 32, let’s look at that.

Code
data_no_mentions %>%
  filter(no_mention_topic == 32, bert_topic == 3) %>%
  sample_n(10) %>%
  select(text_bert, text_no_mentions) %>% DT::datatable(options = list(pageLength = 5))

This seems to all people congratulating others and is probably a fair topic. Looks like by removing mentions has resulted in only the posts that should be grouped together being grouped together.

What affect does this have on the rest of the data?

Code
no_mention_contingency[1:20,] %>% 
  as.data.frame() %>%
  rename(tokenised = Var1, post_level = Var2) %>%
  group_by(tokenised) %>% 
  mutate(prop = Freq/sum(Freq)) %>%
  ggplot(aes(x = tokenised, y = post_level, fill = prop)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Contingency Table Heatmap", x = "Original Post Level Topic", y = "Mentions removed Post Level Topic")

Some of the original topics seem to split across multiple topics in this analysis (eg. topic 0 and topic 2), but for the most part the original topics map to only one topic in this analysis. It looks like, similar to when we perform the analysis in lower case, the only place there is high levels of disagreement is when it comes to outliers. Another indicator that outliers are hard and we need to be careful when messing with them!

Again we can look at the adjusted rand index.

Code
no_mentions_topic <- data_no_mentions$no_mention_topic
bert_topic <- data_no_mentions$bert_topic

sklearn$metrics$adjusted_rand_score(bert_topic, no_mentions_topic)
[1] 0.4031806

Here we get an an adjusted rand score of 0.40. The outlier disagreement and the dissolution of some topics that appeared to be grouped based on mentions are likely the main contributors to this score not being closer to 1.

Finally, is post length still affecting topics? Looks like it is.

Code
data_no_mentions %>% 
  mutate(char_length_no_mention = str_length(text_no_mentions),
         no_mention_topic = as.factor(no_mention_topic)) %>%
ggplot(aes(x = no_mention_topic, y = char_length_no_mention)) +
  geom_boxplot() +
  labs(x = "Topic", y = "Character Length") +
  theme_minimal()  

Code
central_tendency_no_mention <- data_no_mentions %>%
  mutate(char_length_no_mention = str_length(text_no_mentions)) %>%
  group_by(no_mention_topic) %>%
  summarise(mean_length = mean(char_length_no_mention),
         median_length = median(char_length_no_mention),
         variance = var(char_length_no_mention),
         std = sd(char_length_no_mention))

central_tendency_no_mention %>%
  tidyr::gather(measure, result, -no_mention_topic) %>%
  ggplot(aes(x = result)) +
  geom_histogram() +
  facet_wrap(~ measure, scales = "free") +
  labs(title = "Topics chosen using BertopicR")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Code
summary(aov(no_mention_topic ~ char_length_no_mention, data_no_mentions %>%
              mutate(char_length_no_mention = str_length(text_no_mentions))))
                           Df   Sum Sq Mean Sq F value Pr(>F)    
char_length_no_mention      1    95929   95929   485.6 <2e-16 ***
Residuals              249008 49195762     198                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.5 Conclusion

What affect is post length having on topic modelling? The fact that across each analysis, the post length distribution across topics is so varied would suggest that it does. That said, the topics themselves appear to be valid and what is to say that they shouldn’t be grouped like this regardless of post length. When we looked at the contingency table after topic modelling at the sentence token level, it was generally the case that sentences from post level topics mapped to the same sentence token level post and to the outlier category. This would suggest that topic modelling at the post level isn’t actually a terrible generalisation. While we saw that about 20% of posts contained more than one topic, when we look at individual topics, the consesous across analyses was pretty good.

Are posts that were in the outlier topic, now getting sentences put into another topic? For the most part, yes. 42% (see above) of outlier posts have at least one tokenised sentence assigned to a specific topic after tokenisation into sentences. That said, we did see that not all topics were overly meaningful, eg. topic 44 was made up of sentences that were mainly outliers at the post level and it did not provide any useful information.

Are posts that were classified as a particular topic having their individual sentences put in the outlier category? Yes! A quick glance at the first row of our contingency_table tells us that a large number of sentences from every post level topic are being classified as outliers after sentence token level topic modelling.

Are sentences from certain topics tending to go to the same topic after tokenisation into sentences? In a lot of cases, yes eg. we looked at how topic 1 at the sentnece token level seems to correspond relatively well with topic 2 at the post level and topic 3 at the sentence token level corresponds best with topic 8 at the post level.

Should we tokenise our data into sentences? The big draw back of tokenising data for the purpose of modelling on smaller text chunks is processing time, it massively increases the dimensions of our embeddings (it more than doubled here). If we want to embed in r, this is not really feasible, if we are happy to embed using colab and load results into r, it only takes a few minutes and shouldn’t be a barrier. Tokenising data into sentences can really only improve our results by granulising them, however one question is how we would deal with posts that fall into multiple topics in terms of presenting them to a client. Another big advantage of tokenising data into sentences is that when you only have to scan one sentence, it makes it a lot easier for us as users to interpret data and understand what topics are about.

Is there different results when embedding lowercase vs standard case data? Lowercasing the data does not appear to affect the identification of clusters, but does disagree on what is classified as an outlier. It seems that there has been consistent disagreement on outliers across this analysis and without a ground truth it is tough to say what best practice would be.

Should we remove mentions? The quick analysis above would suggest that sometimes things are grouped based on mentions. This may not always be a bad things and maybe sometimes it makes sense to form a topic around a mention. Removing mentions didn’t seem to alter our results too much and so if you feel it makes sense to remove them for your dataset, it should be ok to do.

Other comments: There has been consistent disagreement on outliers across this analysis. Without a ground truth it is impossible to say definitively what should and should not be an outlier and so the biggest takeaway for me is to be conscious when performing topic modelling and changing the outlier definition.