Comparing Topics & Subtopics at Scale

1 Background and Motivation

The development of BertopicR (based entirely on Maarten Grootendorst’s bertopic Python package) has highlighted the need for a new set of tools which can be used to understand topics and subtopics (or clusters and subclusters, herein used interchangeably) at scale.

First, we need to be able to identify what’s changed from one topic model to the next. We’d also like to be able to understand what each topic comprises in terms of subtopics. For both of these task a contingency table will go a long way; heatmaps & variation matrices are likely to be a useful tool too.

When we’re comparing multiple topic models, we’ll likely want to lean on something like the Adjusted Rand Index (ARI), or Normalised Mutual Information (NMI) and create a matrix of ARI/NMI scores for our topic models.

Second, we need to be able to understand how established topics & subtopics relate to one another - i.e. which topics are similar (they have edges with similar subtopics) and which subtopics are similar (they share edges with similar topics, or have similar cosine similarity to one another). Understanding how our subtopics and topics relate will help us to better understand what they are individually.

Third, we need better tools for exploring the topic modelling (and intermediate steps e.g. dim reduction & clustering) outputs. The most likely candidate for this would seem to be a Shiny application, which takes a data frame of document; topic; subtopic; and reduced_embeddings as inputs.

Using the bertopic package has shown us the double-edged sword that is HDBSCan clustering, which solves some of the issues associated with kMeans clustering, but at a considerable cost: we end up with too many topics for the time and attention we have to give them; we end up with a lot of noise.

Let’s get some data and begin our exploration

2 Preparing Data for Exploration

We’ll get a data frame of documents and perform kMeans + HDBscan clustering on them + get their topic representations from BertopicR.

library(BertopicR)
BertopicR:::import_bertopic()
library(reticulate)

2.1 Required R libraries:

library(readr)
library(ggplot2)
library(dplyr)
library(stringr)
library(tidyr)
library(here)
library(plotly)
library(htmltools)

2.2 Loading Base Data

We’ll get a data frame of pre-calculated embeddings and remove the index.

embeddings <- read_csv("~/Google Drive/My Drive/data_science_project_work/microsoft/project_work/635_ai_landscape_q4/data/helper_data/635_embeddings.csv")

#Frame with same # rows as embeddings with text and id variables
data <- read_csv("~/Google Drive/My Drive/data_science_project_work/microsoft/project_work/635_ai_landscape_q4/data/helper_data/635_for_colab_mixed_data.csv")
  
#Drop the index which was saved by Pandas (didn't add index = 0/False)
embeddings <- embeddings %>%
  select(-1) %>%
  janitor::clean_names()

embeddings <- embeddings %>%
  mutate(id = data$universal_message_id,
         text = data$text,
         .before = 1)
embeddings %>% select(starts_with("x"))

2.3 Dimensionality Reduction

Then we’ll use UMAP to reduce our dimensions to 25 components, meaning our clustering algorithm has 25 features to extract clusters from.

reducer <- bt_make_reducer_umap(n_neighbors = 20L, n_components = 25L, verbose = TRUE)
reduced_dims <- embeddings %>% select(starts_with("x")) %>% bt_do_reducing(reducer = reducer,embeddings =.)
reduced_dims <- reduced_dims %>% as.data.frame()

2.4 Data Clustering with kMeans and HDBSCan

We’ll get an output for both kMeans and HDBScan clusters, as they both have differing strengths & weaknesses, combining them should be powerful.

kmeans <- bt_make_clusterer_kmeans()
kmeans_clusters <- bt_do_clustering(reduced_dims, kmeans)
kmeans_clusters <- bt_do_clustering(reduced_dims %>% as.data.frame(), kmeans)

With min_cluster_size = 40L we had ~600 topics min_cluster_size = 100L we had ~ 200 min_cluster_size = 200L we had ~120 + 133k Noise (less than in others)

hdbscan <- bt_make_clusterer_hdbscan(min_cluster_size = 200L, min_samples = 30L, metric = "euclidean")
hdbscan_clusters <- bt_do_clustering(hdbscan, reduced_dims)
hdbscan_clusters <- tibble::tibble(hdbscan_clusters) %>% mutate(hdbscan_clusters = as.integer(hdbscan_clusters))

Mutate the frames together:

joined_data <- data %>%
  select(- topic) %>%
  mutate(kmeans_clusters = kmeans_clusters$kmeans_clusters,
         hdbscan_clusters = hdbscan_clusters$hdbscan_clusters,
         .before = permalink)

HDBScan outputs ‘noise’ as a -1 label which can be used to remove documents which have not fit into an HDBScan cluster. We tend to find that there is a lot of noise, and that some documents are incorrectly assigned the noise label. In an ideal world, before we filtered we would find the documents which have been incorrectly labelled as noise and re-assign them to an appropriate topic. However, it’s not trivial to reduce noise at scale without simultaneously placing documents which were correctly labelled as noise into clusters they don’t belong in. Navigating this trade-off is tricky, so for the purposes of this demonstration we will remove all noise in one fell swoop.

filtered_data <- joined_data %>%
  filter(hdbscan_clusters != -1)

Now that we have our data in place, let’s examine our topics!

3 Topic Contingency Table

If we take our assigned kMeans clusters and think of them as topics, and our assigned HDBScan clusters and think of them as subtopics, we can feed in our list of topics and subtopics to table() and create a contingency table. This table helps us understand how each subtopic is distributed among topics, and vice-versa.

Create a contingency table of our kmeans and hdbscan clusters:

contingency_table <- table(filtered_data$kmeans_clusters, filtered_data$hdbscan_clusters)

# contingency_table <- table(filtered_data$hdbscan_clusters, filtered_data$kmeans_clusters) #Looks better for the variation matrix

The first input, kmeans_clusters, will place the topics as rows, and the second input, hdbscan_clusters, will place the subtopics as columns. We have 121 subtopics (0-120), and 10 topics (0-10), so we’ll have 121 columns and 10 rows:

dim(contingency_table)
## [1]  10 121

Summing the values of each column will give us the total volume for that HDBScan cluster.

Summing the rows will give us the total volume for each kMeans cluster.

Double check that the values correspond to what we think they should:

column_index <- 10 # chosen at random
filtered_data %>% count(hdbscan_clusters) %>% slice(column_index) %>%  pull(n) ==
  colSums(contingency_table)[[column_index]]
## [1] TRUE
row_index <- 5 #chosen at random
filtered_data %>% count(kmeans_clusters) %>% slice(row_index) %>% pull(n) == rowSums(contingency_table)[[row_index]]
## [1] TRUE

3.1 Analysing Topic Distributions with Contingency Table

We’ll have a look at the table as a whole:

contingency_table # y = kMeans clusters (0:9 or 1:10), x = hdbscan_clusters
##    
##        0    1    2    3    4    5    6    7    8    9   10   11   12   13   14
##   0    0    0    0    0  309    0    0    0 2582    0    0    0    0    0    0
##   1    0    0  297    0    0    0    0    0    0  996  244    0    0    0    0
##   2    0    0    0    0    0   22    0    0    0    0    0    0    0    0    0
##   3    0    0    0    0    0    0    0    0    0    0    0    0  163    0    0
##   4    0    0    0    0    0    0    0    0    0    0    0    0    0  435  728
##   5    0    0    0 5662   21 1702 2478    0    0    0    0    0    0    0    0
##   6 1076 4974    0    0    0    0    0    0    0    0    0    0    0    0    0
##   7    0    0    0    0    0    0    0    0  479    1    0    0    0    0    0
##   8    0    0    0    0    0    0    0    0    0  123    0    0   65    0    0
##   9    0    0    0    0    0    0    0  266    0    4    0 1315   33    0    0
##    
##       15   16   17   18   19   20   21   22   23   24   25   26   27   28   29
##   0    0    0    0    0    0  324    0    1    0    0   13    0  376    0    0
##   1    0    0    0  241    0    0    0    0    0    0    0    0    0    0 1594
##   2    0    0    0    0    0    0    0  767    0    0    0    0    0  298    0
##   3  453  235    0    0    0    0    0    0    0  323    0  127    0    0    0
##   4    0    0  321    0    0    0    0    0    0    0 3352    2    0    0    0
##   5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   7    0    0    0    0 7575    1  271    0  828    1    0    0    0    0    0
##   8    0    2   23    0    0    0    0    0    0    0    0  321    0    0    0
##   9    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0
##    
##       30   31   32   33   34   35   36   37   38   39   40   41   42   43   44
##   0  462    0    0  223  280    1   35    0  206    0    0    0    0    0    0
##   1    0  283 2880    0    0    0  458    0    0  399    0  260    0    0    0
##   2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   3    0    0    0    0    0    0    0  203    0    0    0    0    0    0    0
##   4    0    0    0    0    0  450    0    0    0    0   11    0    0    0    0
##   5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   7    1    0    0    0    0    0    0    0    0    8    0    0  134    0    0
##   8    0    0    0    0    0    0    0    0    0    0  381    0  409    4 1003
##   9    0    0    0    0    0    0    0    0    0    0    0    0    0  864    0
##    
##       45   46   47   48   49   50   51   52   53   54   55   56   57   58   59
##   0    0 2211  350  641  584  458  477    0  404  403  475  978  577  779  210
##   1    0    0   79    0    0    0    0    0    0    0    0    0    0    0    0
##   2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   3   78    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4    3    0    0    0    0    0    0    0   32    0    0    0    0    0    0
##   5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   8  776    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   9    0    0    0    0    0    0    0  917    0    0    0    0    0    0    0
##    
##       60   61   62   63   64   65   66   67   68   69   70   71   72   73   74
##   0    0    0  185    0    0    0    0    0    0    0    0    0    0    0    0
##   1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   2    0    0  546    0    0    0    0    0 3122  627    0  267    0    0    0
##   3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   4  104    0    0  578   49  203 1175  882    0    0 1409    0    0  967  728
##   5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   7    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   8  113    0    0    0  325    0    0    0    0    0    0    0  555    0    0
##   9    0  763    0    0    0    0    0    0    0    0    0    0    0    0    0
##    
##       75   76   77   78   79   80   81   82   83   84   85   86   87   88   89
##   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   1    0    0    0    0    0    0    0  109    0    0  530    0    0    0    0
##   2 4825 1987    0    0    0    0    0    0    0    0    0    0    0    0    0
##   3    0    0    0    0    0    0    0    0    0    0    0    0    0   76    0
##   4    0    0  960 1744  321    0    0    0    0    0    0    0    0    0    5
##   5    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   7    0    0    0    0    0    0  417  290    0  253   86    0  810    0    0
##   8    0    0    0    0    3  529    0    0  203    0    0    0    0  126 1587
##   9    0    0    0    0    0    0    0    0    0    0    0  568    0    0    0
##    
##       90   91   92   93   94   95   96   97   98   99  100  101  102  103  104
##   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   1    0    0    0    1    0    0    0    0    0    0    0    0  209    0    0
##   2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   3  196    0    0    0    0    0 1994    0    0    0    0    0    0    0    0
##   4    0    0    0    0    0    0    0    0    0   95    0    0    0    0  887
##   5    0    0    0    0    0    0    0    0    0    0    0  415    0  521 1093
##   6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   7    0    0    0    0    0    0    0    0  241    0    0    0    0    0    0
##   8  208  961    0    0    0    0    0    0    0    0    0    0    0    0    0
##   9    0    0  206  494 2112  484    0  344    0  694 2802    0    3    0    0
##    
##      105  106  107  108  109  110  111  112  113  114  115  116  117  118  119
##   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   1    0    0    0    0    0    0    0    0   22    2    0    0  246 2235  293
##   2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   3    0    0    0  647  296  326 5953    0    0    0    0    0    0    0    0
##   4    0  318  477    0    0    0    0    0    0    0    0    0    0    0    0
##   5 2292    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   6    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##   7    0    0    0    0    0    0    0    0 3157    0    0    0    0    0    0
##   8    0    0 1126    0    0    0    0  769    8 2351  260  408    0    0    0
##   9    0    0    0    0    0    0    0    0    0 1315    0    0    0  207    0
##    
##      120
##   0    0
##   1  635
##   2    0
##   3    0
##   4    0
##   5    0
##   6    0
##   7    0
##   8    0
##   9    0

The first thing we notice is that for each column, most of the values in each row are 0; this means that none of the documents assigned to that subtopic (column) belonged to that topic (row).

The contingency table is most appropriate for looking up specific values, for example, if we want to see how subtopic 113 is distributed, we go to the 114th column (we started counting at 0), and we see that its vector of values is:

topic: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

count: [0, 22, 0, 0, 0, 0, 0, 3157, 8, 0]

So we can see that subtopic 113 has had the vast majority of its documents assigned to topic 7, with the leftovers being assigned to topics 1 & 8 with 22 and & 8 documents respectively.

3.2 Qualitative Comparison of Subtopics

Investigate whether the subtopics which fit into bigger topics are indeed similar:

filtered_data %>%
  filter(hdbscan_clusters %in% c(3, 5, 6),
         kmeans_clusters == 5)  %>%
  select(text, hdbscan_clusters) %>%
  DT::datatable(filter = "top")

subtopic 3 appears to be about songs and music with a focus on AI Covers or AI versions thereof. subtopic 5 appears to be mainly about AI-generated videos subtopic 6 appears to be mainly about AI voiceover artists

Each of these subtopics fits fairly cleanly into topic 5. We might want to infer at this stage that topic 5 has to do with AI-generated content or entertainment.

Let’s check whether subtopics 0 & 1, which fit into topic 6 rather than 5, seem different to 3, 5, 6:

filtered_data %>%
  filter(hdbscan_clusters %in% c(0, 1))  %>%
  select(clean_text, hdbscan_clusters) %>%
  DT::datatable(filter = "top")

subtopic 0 is about Snap AI whereas subtopic 1 appears to have been clustered on the leading digits - which are representative of a specific website’s username values. This shows the importance of properly cleaning data before topic modelling.

For example:

str_remove_all(c("122453 remove these. hello this is a demo. Keep these numbers 10282"), pattern = "^\\d+")
## [1] " remove these. hello this is a demo. Keep these numbers 10282"

Let’s try a different couple of sub topics, we’ll go with 29 and 32, which both have all their posts in topic 1:

filtered_data %>%
  filter(hdbscan_clusters %in% c(29, 32))  %>%
  select(text, hdbscan_clusters) %>%
  DT::datatable(filter = "top")

29 is about relgiion, 32 is about healthcare. They are not that similar at a glance, but they are different to the AI-generated content, that’s for sure. It looks like the subtopics are doing their job!

3.2.1 Bigram Network for Topic Understanding

We said earlier: 'We might want to infer at this stage that topic 5 has to do with AI-generated content or entertainment.'
filtered_data %>%
  filter(kmeans_clusters == 5) %>%
  JPackage::make_bigram_viz(clean_text)

The bigram network contains many references to things such as sound, video, music, art, movies etc. this suggests that our inference may have been correct.

3.2.2 Weighted Log-odds for Within Topic Comparison

We have identified that there are some subtopics which tend to fit nicely into topic 5, we could use weighted log-odds to understand their differences

filtered_data %>%
  filter(kmeans_clusters == 5) %>%
  ParseR::calculate_wlos(topic_var = hdbscan_clusters, text_var = clean_text, top_terms_cutoff = 2500, nrow = 7) %>%
  purrr::pluck("viz")

We can see quite clearly that our subtopics are fairly similar, but importantly different. This is a satisfying result.

3.3 Distributions of values

The contingency table works great for looking up individual data points with precision, but perhaps we can visualise the distributions in a better way.

subtopic_in_topic <- filtered_data %>%
  count(hdbscan_clusters, kmeans_clusters, name = "hdb_subtotal") %>%
  add_count(hdbscan_clusters, wt =  hdb_subtotal, name = "hdb_total") %>%
  mutate(hdb_prop = hdb_subtotal/hdb_total)
density <- subtopic_in_topic %>%
  ggplot(aes(x= hdb_prop)) +
  geom_density(alpha = 0.8, fill = "midnightblue", color = "white") +
  theme_minimal() +
  labs(x = "Proportion of HDBScan cluster in kMeans clusters")
histogram <- subtopic_in_topic %>%
  ggplot(aes(x= hdb_prop)) +
  geom_histogram(color = "white", fill = "midnightblue") +
  theme_minimal() +
  labs(x = "Proportion of HDBScan cluster in kMeans clusters", )

Combine the density and the histogram

library(patchwork)
density + 
  histogram + 
  plot_layout(ncol = 1) + 
  patchwork::plot_annotation(title =  stringr::str_wrap("Without slicing for the maximum value of proportion per HDBScan cluster, the distribution is extremely dense at 1.0, i.e. all of the HDBScan cluster fits into one kmeans cluster"))

3.3.1 Highest Value Only

Slice the highest value per subtopic, as this tell us what is the maximum proportion of this subtopic which belongs to any of the topics; arguably a better way to cut the data up to figure out how good the fit is.

subtopic_in_topic %>%
  slice_max(n = 1, order_by = hdb_prop, by = hdbscan_clusters) %>%
  summarise(mean = mean(hdb_prop), median = median(hdb_prop), max = max(hdb_prop), min = min(hdb_prop))
At least 50% (median = 1) of our subtopics fit into just 1 topic. The lowest max value of all, is 0.515 or 51.5%

We can check out the histogram and the density for the sliced values just like we did for the unsliced

subtopic_in_topic %>%
  slice_max(order_by = hdb_prop, n = 1, by = "hdbscan_clusters") %>%
  ggplot(aes(x= hdb_prop)) +
  geom_histogram(color = "white", fill = "midnightblue") +
  theme_minimal() +
  labs(x = "Proportion of max(HDBScan cluster in kMeans cluster)") +
  subtopic_in_topic %>%
  slice_max(order_by = hdb_prop, n = 1, by = "hdbscan_clusters") %>%
  ggplot(aes(x= hdb_prop)) +
  geom_density(alpha = 0.8, fill = "midnightblue", color = "white") +
  theme_minimal() +
  labs(x = "Proportion of max(HDBScan cluster in kMeans cluster)") +
  plot_layout(ncol = 1) + 
  plot_annotation(title = stringr::str_wrap("When we slice the maximum value of proportion, such that we have one value per HDBScan cluster, our distribution is (predictably) further skewed towards proportion == 1.0, or all of HDBScan cluster fits inside one kmeans cluster."))

What’s the insight?

In our sample, virtually all subtopics neatly into one topic. Furthermore, there is no subtopic which does not have > 50% of its values inside a single topic

3.4 Interpreting Fit with a Variation Matrix

If we want to get a general understanding of how our subtopics fit into topics, a heatmap or variation matrix may be more useful than a contingency table, and will complement the distributions we just looked at.

Each cell of the variation matrix will be coloured according to the % of that column’s subtopic which belongs to that row’s topic.

Given that we saw a high number of 0s in our contingency table and the density being highest at 1, we would expect to see both a lot of white (0 value) and a lot of yellow (close to 100%).

plot <- filtered_data %>%
  LandscapeR::ls_plot_variation_matrix(hdbscan_clusters, kmeans_clusters) +
  theme(axis.text.x = element_text(angle = 90))

plot %>%
  plotly::ggplotly()

4 Understanding Relationships Between Subtopics & Topics

To better understand the intricate relationships between our topics and subtopics, we’ll visualize them using network plots.

The network plots are currently a work in progress

We can transform our contingency table into a data frame, and our data frame into a graph

library(tidygraph)
library(ggraph)
library(ggrepel)

df <- contingency_table %>% 
  as.data.frame() %>%
  tibble() %>%
  filter(Freq != 0) %>%
  # rename(from = Var2, to = Var1, weight = Freq) %>%
  rename(from = Var1, to = Var2, weight = Freq) %>%
  mutate(from = paste0("topic_", from)) %>%
  mutate(from = str_wrap(from, 10))

# Convert df to tidygraph object
(graph <- as_tbl_graph(df))
## # A tbl_graph: 131 nodes and 168 edges
## #
## # A directed acyclic simple graph with 2 components
## #
## # A tibble: 131 × 1
##   name   
##   <chr>  
## 1 topic_6
## 2 topic_1
## 3 topic_5
## 4 topic_0
## 5 topic_2
## 6 topic_9
## # ℹ 125 more rows
## #
## # A tibble: 168 × 3
##    from    to weight
##   <int> <int>  <int>
## 1     1    11   1076
## 2     1    12   4974
## 3     2    13    297
## # ℹ 165 more rows

4.1 Interactive Network

Rendering an interactive network will allow us to explore how topics are connected (they share edges with subtopics), and which subtopics are similar with respect to how they are distributed into topics (they share edges with the same topics).

Network visualisations are the GOATs when it comes to unearthing direct and indrect relationships between things.

Prepare the graph for visNetwork by adding some attributes in columns which visNetwork will automatically apply to our visualisation.

interactive_graph <- graph %>% 
  activate(nodes) %>% 
  mutate(
    origin = if_else(name %in% df$from, 'from', 'to'),
    color = ifelse(origin == "from", "black", "red"),
    size = ifelse(origin == "from", 10, 5),
    shape = ifelse(origin == "from", "circle", "star"),
    font.color = ifelse(origin == "from", "white", "black"),
    font.size = ifelse(origin == "from", 40, 20),
    label = name
  )


#Get edges and nodes for interactive viz
nodes_data <- as_tibble(interactive_graph %>% activate(nodes)) %>%
  mutate(id = row_number(), .before = 1) 

edges_data <- as_tibble(interactive_graph %>% activate(edges))

library(visNetwork)

#Customise nodes for better label visibility
nodes_data$font.size <- 20

nodes_data$font.face <- "bold"
nodes_data$shadow.enabled <- TRUE
nodes_data$shadow.size <- 10
nodes_data$shadow.x <- 2
nodes_data$shadow.y <- 2

Now we can visualise using visNetwork. We’ll set a randomSeed so the layout is similar each time we run it.

visNetwork(nodes = nodes_data, edges = edges_data) %>%
  visIgraphLayout(
    randomSeed = 42,
    layout = "layout_nicely",
    physics = TRUE,
    type = "square" #,
    # smooth = TRUE #saves time if not calling with this.
  )  %>%
  visOptions(
    highlightNearest = 
      list(enabled = TRUE, degree = 2, hover = TRUE),
    selectedBy = "origin",
    nodesIdSelection = TRUE) %>%
  # visPhysics(solver = "barnesHut")  %>%
  visPhysics(solver = "forceAtlas2Based") 

5 Insights and Challenges

Generally, HDBScan clusters fit tidily into kMeans clusters. This should mean that we’re able to use both of them for different analyses; according to the demands of a given research project. It should also mean that they are complementary when used together.

At first glance, the subtopics which fit into topics appear to make sense - or they seem to be interpretable.

The contingency table gets difficult to work with when the number of HDBScan clusters gets high enough (50-100+) - simply visualising that many rows or columns is tricky.

The variation matrix can come in handy when the contingency table gets too big, but it’s not made to analyse specific points. It will also be difficult to work with when the number of columns get much higher. However, it useful as an indication of how well the HDBScan clusters fit into the kMeans clusters (lots of yellow = they fit well.).

The network plots help to identify relationships (direct and indirect), the challenge is how best to use them.

6 Automating Analysis with Functions

If we think the exploration has been useful, we may want to wrap everything up into functions to automate the analysis:

6.1 Summary statistics function

subtopic_summary_max <- function(df, subtopic_var, topic_var) {
  subtopic_sym <- rlang::ensym(subtopic_var)
  topic_sym <- rlang::ensym(topic_var)
  
  #Count + proportion
  data <- df %>%
    dplyr::count(!!subtopic_sym, !!topic_sym, name = "subtopic_subtotal") %>%
    dplyr::filter(subtopic_subtotal > 0) %>%
    dplyr::add_count(!!subtopic_sym, wt = subtopic_subtotal, name = "subtopic_total") %>%
    dplyr::mutate(proportion = subtopic_subtotal/subtopic_total) 
  
  message("Summary statistics counting the highest value for [proportion of subtopic in single topic]:")
  
  #Slice & Summarise
  data <- data %>%
    dplyr::slice_max(n = 1, order_by = subtopic_subtotal, with_ties = FALSE, by = !!subtopic_sym) %>%
    dplyr::summarise(mean = mean(proportion), median = median(proportion), min = min(proportion), max = max(proportion))
  
  return(data)
  
}
filtered_data %>%
  subtopic_summary_max(hdbscan_clusters, kmeans_clusters)

6.2 Adjusted Rand Index (?)

Blank for now

library(aricode)

6.3 Interactive Topic Network Function

6.3.1 Calculate Subtopic - Topic Distribution

calculate_subtopic_in_topic <- function(df, subtopic_var, topic_var) {
  
  subtopic_sym <- rlang::ensym(subtopic_var)
  topic_sym <- rlang::ensym(topic_var)
  
  #Subset df for the two columns we need
  data <- df %>% dplyr::select(!!subtopic_sym, !!topic_sym)
  
  #For later graphing
  colnames(data) <- c(".subtopic", ".topic")
  
  data <- data %>% dplyr::count(.subtopic, .topic, name = "weight") %>%
    dplyr::filter(weight != 0) %>%
    dplyr::mutate(.topic = paste0("Topic ", .topic))
  
  return(data)
}

(dator <- calculate_subtopic_in_topic(filtered_data, hdbscan_clusters,  kmeans_clusters))

6.3.2 Make Graph (with the output of the calculate values as a separate function)

make_subtopic_topic_graph <- function(output_of_calculate_subtopic_in_topic_function, subtopic_colour = "red", topic_colour = "black") {
  data <- output_of_calculate_subtopic_in_topic_function
  
  graph <- tidygraph::as_tbl_graph(data, directed = FALSE)
  
  graph <- graph %>%
    tidygraph::activate(nodes) %>%
    dplyr::mutate(
      origin = if_else(name %in% data$.subtopic, 'subtopic', 'topic'),
      color = ifelse(origin == "subtopic", subtopic_colour, topic_colour),
      size = ifelse(origin == "subtopic", 5, 10),
      font.color = ifelse(origin == "subtopic","black",  "white"),
      font.size = ifelse(origin == "subtopic", 20, 40)
    )
  
  return(graph)
  
}
(my_graph <- make_subtopic_topic_graph(dator, subtopic_colour = "green", topic_colour = "midnightblue"))
## # A tbl_graph: 131 nodes and 168 edges
## #
## # An undirected simple graph with 2 components
## #
## # A tibble: 131 × 6
##   name  origin   color  size font.color font.size
##   <chr> <chr>    <chr> <dbl> <chr>          <dbl>
## 1 0     subtopic green     5 black             20
## 2 1     subtopic green     5 black             20
## 3 2     subtopic green     5 black             20
## 4 3     subtopic green     5 black             20
## 5 4     subtopic green     5 black             20
## 6 5     subtopic green     5 black             20
## # ℹ 125 more rows
## #
## # A tibble: 168 × 3
##    from    to weight
##   <int> <int>  <int>
## 1     1   122   1076
## 2     2   122   4974
## 3     3   123    297
## # ℹ 165 more rows

6.3.3 Visualise Graph (with output of the graph maker)

viz_subtopic_topic_interactive <- function(graph, layout = "layout_nicely", hover_depth = 1) {

  edges_data <- graph %>%
    tidygraph::activate(edges) %>%
    as.data.frame() %>%
    dplyr::mutate(color = "grey30")
  
  nodes_data <- graph %>%
    tidygraph::activate(nodes) %>%
    dplyr::mutate(id = dplyr::row_number(), .before = 1) %>%
    dplyr::mutate(shape = 
                    ifelse(origin == "subtopic", "star", "circle"),
                  label = name) %>%
    as.data.frame()
  
  #Customise nodes for better label visibility
  nodes_data$font.size <- 20
  nodes_data$font.face <- "bold"
  nodes_data$shadow.enabled <- TRUE
  nodes_data$shadow.size <- 10
  nodes_data$shadow.x <- 2
  nodes_data$shadow.y <- 2
  
  
  network <- visNetwork::visNetwork(
    nodes = nodes_data, 
    edges = edges_data) %>%
    visNetwork::visIgraphLayout(
      # randomseed = 42,
      physics = TRUE,
      layout = layout,
      type = "square") %>%
    visNetwork::visOptions(
      highlightNearest = 
        list(
          enabled = TRUE, 
          degree = hover_depth, 
          hover = TRUE),
      selectedBy = "origin",
      nodesIdSelection = TRUE) %>%
    visNetwork::visPhysics(solver = "forceAtlas2Based") 
  
  return(network)
}
viz_subtopic_topic_interactive(my_graph, hover_depth = 1)

6.3.4 Quick Function

quick_interactive_viz <- function(df, subtopic_var, topic_var, layout = "layout_nicely", hover_depth = 1, topic_colour = "black", subtopic_colour = "red") {
  
  subtopic_sym <- rlang::ensym(subtopic_var)
  topic_sym <- rlang::ensym(topic_var)
  subtopic_string <- rlang::as_string(subtopic_sym)
  topic_string <- rlang::as_string(topic_sym)
  
  calcs <- calculate_subtopic_in_topic(df, !!subtopic_sym, !!topic_sym)
  graph <- make_subtopic_topic_graph(calcs, topic_colour = topic_colour, subtopic_colour = subtopic_colour)
  network <- viz_subtopic_topic_interactive(graph, layout = layout, hover_depth = hover_depth)
  
  return(network)
}

quick_interactive_viz(filtered_data, hdbscan_clusters,  kmeans_clusters, topic_colour = "green", subtopic_colour = "midnightblue")

7 Further work

  • A Shiny application for interactive exploration of topics and their documents (?)
  • Adjusted Rand Index for comparing multiple topic models
  • Tools for merging subtopics into topics
  • Better summary statistics to understand the fitting process at a glance, better visualisations too
  • Experimentation with more HDBScan and kMeans parameters (probably aidded by Shiny app)
  • Workflow recommendation & development (integration with other tools)
  • End user outputs rather than researcher