Comparing Topics & Subtopics at Scale
1 Background and Motivation
The development of BertopicR (based entirely on Maarten Grootendorst’s bertopic Python package) has highlighted the need for a new set of tools which can be used to understand topics and subtopics (or clusters and subclusters, herein used interchangeably) at scale.
First, we need to be able to identify what’s changed from one topic model to the next. We’d also like to be able to understand what each topic comprises in terms of subtopics. For both of these task a contingency table will go a long way; heatmaps & variation matrices are likely to be a useful tool too.
When we’re comparing multiple topic models, we’ll likely want to lean on something like the Adjusted Rand Index (ARI), or Normalised Mutual Information (NMI) and create a matrix of ARI/NMI scores for our topic models.
Second, we need to be able to understand how established topics & subtopics relate to one another - i.e. which topics are similar (they have edges with similar subtopics) and which subtopics are similar (they share edges with similar topics, or have similar cosine similarity to one another). Understanding how our subtopics and topics relate will help us to better understand what they are individually.
Third, we need better tools for exploring the topic modelling (and intermediate steps e.g. dim reduction & clustering) outputs. The most likely candidate for this would seem to be a Shiny application, which takes a data frame of document; topic; subtopic; and reduced_embeddings as inputs.
Let’s get some data and begin our exploration
2 Preparing Data for Exploration
We’ll get a data frame of documents and perform kMeans + HDBscan clustering on them + get their topic representations from BertopicR.
2.1 Required R libraries:
2.2 Loading Base Data
We’ll get a data frame of pre-calculated embeddings and remove the index.
embeddings <- read_csv("~/Google Drive/My Drive/data_science_project_work/microsoft/project_work/635_ai_landscape_q4/data/helper_data/635_embeddings.csv")
#Frame with same # rows as embeddings with text and id variables
data <- read_csv("~/Google Drive/My Drive/data_science_project_work/microsoft/project_work/635_ai_landscape_q4/data/helper_data/635_for_colab_mixed_data.csv")
#Drop the index which was saved by Pandas (didn't add index = 0/False)
embeddings <- embeddings %>%
select(-1) %>%
janitor::clean_names()
embeddings <- embeddings %>%
mutate(id = data$universal_message_id,
text = data$text,
.before = 1)
embeddings %>% select(starts_with("x"))2.3 Dimensionality Reduction
Then we’ll use UMAP to reduce our dimensions to 25 components, meaning our clustering algorithm has 25 features to extract clusters from.
2.4 Data Clustering with kMeans and HDBSCan
We’ll get an output for both kMeans and HDBScan clusters, as they both have differing strengths & weaknesses, combining them should be powerful.
kmeans <- bt_make_clusterer_kmeans()
kmeans_clusters <- bt_do_clustering(reduced_dims, kmeans)
kmeans_clusters <- bt_do_clustering(reduced_dims %>% as.data.frame(), kmeans)With min_cluster_size = 40L we had ~600 topics min_cluster_size = 100L we had ~ 200 min_cluster_size = 200L we had ~120 + 133k Noise (less than in others)
hdbscan <- bt_make_clusterer_hdbscan(min_cluster_size = 200L, min_samples = 30L, metric = "euclidean")
hdbscan_clusters <- bt_do_clustering(hdbscan, reduced_dims)
hdbscan_clusters <- tibble::tibble(hdbscan_clusters) %>% mutate(hdbscan_clusters = as.integer(hdbscan_clusters))Mutate the frames together:
joined_data <- data %>%
select(- topic) %>%
mutate(kmeans_clusters = kmeans_clusters$kmeans_clusters,
hdbscan_clusters = hdbscan_clusters$hdbscan_clusters,
.before = permalink)HDBScan outputs ‘noise’ as a -1 label which can be used to remove documents which have not fit into an HDBScan cluster. We tend to find that there is a lot of noise, and that some documents are incorrectly assigned the noise label. In an ideal world, before we filtered we would find the documents which have been incorrectly labelled as noise and re-assign them to an appropriate topic. However, it’s not trivial to reduce noise at scale without simultaneously placing documents which were correctly labelled as noise into clusters they don’t belong in. Navigating this trade-off is tricky, so for the purposes of this demonstration we will remove all noise in one fell swoop.
Now that we have our data in place, let’s examine our topics!
3 Topic Contingency Table
If we take our assigned kMeans clusters and think of them as topics,
and our assigned HDBScan clusters and think of them as subtopics, we can
feed in our list of topics and subtopics to table() and
create a contingency table. This table helps us understand how each
subtopic is distributed among topics, and vice-versa.
Create a contingency table of our kmeans and hdbscan clusters:
contingency_table <- table(filtered_data$kmeans_clusters, filtered_data$hdbscan_clusters)
# contingency_table <- table(filtered_data$hdbscan_clusters, filtered_data$kmeans_clusters) #Looks better for the variation matrixThe first input, kmeans_clusters, will place the topics as rows, and the second input, hdbscan_clusters, will place the subtopics as columns. We have 121 subtopics (0-120), and 10 topics (0-10), so we’ll have 121 columns and 10 rows:
## [1] 10 121
Summing the values of each column will give us the total volume for that HDBScan cluster.
Summing the rows will give us the total volume for each kMeans cluster.
Double check that the values correspond to what we think they should:
column_index <- 10 # chosen at random
filtered_data %>% count(hdbscan_clusters) %>% slice(column_index) %>% pull(n) ==
colSums(contingency_table)[[column_index]]## [1] TRUE
row_index <- 5 #chosen at random
filtered_data %>% count(kmeans_clusters) %>% slice(row_index) %>% pull(n) == rowSums(contingency_table)[[row_index]]## [1] TRUE
3.1 Analysing Topic Distributions with Contingency Table
We’ll have a look at the table as a whole:
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 0 0 0 0 0 309 0 0 0 2582 0 0 0 0 0 0
## 1 0 0 297 0 0 0 0 0 0 996 244 0 0 0 0
## 2 0 0 0 0 0 22 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 163 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 435 728
## 5 0 0 0 5662 21 1702 2478 0 0 0 0 0 0 0 0
## 6 1076 4974 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 479 1 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 123 0 0 65 0 0
## 9 0 0 0 0 0 0 0 266 0 4 0 1315 33 0 0
##
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 0 0 0 0 0 0 324 0 1 0 0 13 0 376 0 0
## 1 0 0 0 241 0 0 0 0 0 0 0 0 0 0 1594
## 2 0 0 0 0 0 0 0 767 0 0 0 0 0 298 0
## 3 453 235 0 0 0 0 0 0 0 323 0 127 0 0 0
## 4 0 0 321 0 0 0 0 0 0 0 3352 2 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 7575 1 271 0 828 1 0 0 0 0 0
## 8 0 2 23 0 0 0 0 0 0 0 0 321 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
##
## 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
## 0 462 0 0 223 280 1 35 0 206 0 0 0 0 0 0
## 1 0 283 2880 0 0 0 458 0 0 399 0 260 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 203 0 0 0 0 0 0 0
## 4 0 0 0 0 0 450 0 0 0 0 11 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 1 0 0 0 0 0 0 0 0 8 0 0 134 0 0
## 8 0 0 0 0 0 0 0 0 0 0 381 0 409 4 1003
## 9 0 0 0 0 0 0 0 0 0 0 0 0 0 864 0
##
## 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
## 0 0 2211 350 641 584 458 477 0 404 403 475 978 577 779 210
## 1 0 0 79 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 78 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 3 0 0 0 0 0 0 0 32 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 8 776 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 917 0 0 0 0 0 0 0
##
## 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
## 0 0 0 185 0 0 0 0 0 0 0 0 0 0 0 0
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 546 0 0 0 0 0 3122 627 0 267 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 104 0 0 578 49 203 1175 882 0 0 1409 0 0 967 728
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 8 113 0 0 0 325 0 0 0 0 0 0 0 555 0 0
## 9 0 763 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1 0 0 0 0 0 0 0 109 0 0 530 0 0 0 0
## 2 4825 1987 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 76 0
## 4 0 0 960 1744 321 0 0 0 0 0 0 0 0 0 5
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 417 290 0 253 86 0 810 0 0
## 8 0 0 0 0 3 529 0 0 203 0 0 0 0 126 1587
## 9 0 0 0 0 0 0 0 0 0 0 0 568 0 0 0
##
## 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1 0 0 0 1 0 0 0 0 0 0 0 0 209 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 196 0 0 0 0 0 1994 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 95 0 0 0 0 887
## 5 0 0 0 0 0 0 0 0 0 0 0 415 0 521 1093
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 241 0 0 0 0 0 0
## 8 208 961 0 0 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 206 494 2112 484 0 344 0 694 2802 0 3 0 0
##
## 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1 0 0 0 0 0 0 0 0 22 2 0 0 246 2235 293
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 647 296 326 5953 0 0 0 0 0 0 0 0
## 4 0 318 477 0 0 0 0 0 0 0 0 0 0 0 0
## 5 2292 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 3157 0 0 0 0 0 0
## 8 0 0 1126 0 0 0 0 769 8 2351 260 408 0 0 0
## 9 0 0 0 0 0 0 0 0 0 1315 0 0 0 207 0
##
## 120
## 0 0
## 1 635
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
The first thing we notice is that for each column, most of the values in each row are 0; this means that none of the documents assigned to that subtopic (column) belonged to that topic (row).
topic: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
count: [0, 22, 0, 0, 0, 0, 0, 3157, 8, 0]
So we can see that subtopic 113 has had the vast majority of its documents assigned to topic 7, with the leftovers being assigned to topics 1 & 8 with 22 and & 8 documents respectively.
3.2 Qualitative Comparison of Subtopics
Investigate whether the subtopics which fit into bigger topics are indeed similar:
filtered_data %>%
filter(hdbscan_clusters %in% c(3, 5, 6),
kmeans_clusters == 5) %>%
select(text, hdbscan_clusters) %>%
DT::datatable(filter = "top")subtopic 3 appears to be about songs and music with a focus on AI Covers or AI versions thereof. subtopic 5 appears to be mainly about AI-generated videos subtopic 6 appears to be mainly about AI voiceover artists
Each of these subtopics fits fairly cleanly into topic 5. We might want to infer at this stage that topic 5 has to do with AI-generated content or entertainment.
Let’s check whether subtopics 0 & 1, which fit into topic 6 rather than 5, seem different to 3, 5, 6:
filtered_data %>%
filter(hdbscan_clusters %in% c(0, 1)) %>%
select(clean_text, hdbscan_clusters) %>%
DT::datatable(filter = "top")subtopic 0 is about Snap AI whereas subtopic 1 appears to have been clustered on the leading digits - which are representative of a specific website’s username values. This shows the importance of properly cleaning data before topic modelling.
For example:
str_remove_all(c("122453 remove these. hello this is a demo. Keep these numbers 10282"), pattern = "^\\d+")## [1] " remove these. hello this is a demo. Keep these numbers 10282"
Let’s try a different couple of sub topics, we’ll go with 29 and 32, which both have all their posts in topic 1:
filtered_data %>%
filter(hdbscan_clusters %in% c(29, 32)) %>%
select(text, hdbscan_clusters) %>%
DT::datatable(filter = "top")29 is about relgiion, 32 is about healthcare. They are not that similar at a glance, but they are different to the AI-generated content, that’s for sure. It looks like the subtopics are doing their job!
3.2.1 Bigram Network for Topic Understanding
The bigram network contains many references to things such as sound, video, music, art, movies etc. this suggests that our inference may have been correct.
3.2.2 Weighted Log-odds for Within Topic Comparison
We have identified that there are some subtopics which tend to fit nicely into topic 5, we could use weighted log-odds to understand their differences
filtered_data %>%
filter(kmeans_clusters == 5) %>%
ParseR::calculate_wlos(topic_var = hdbscan_clusters, text_var = clean_text, top_terms_cutoff = 2500, nrow = 7) %>%
purrr::pluck("viz")We can see quite clearly that our subtopics are fairly similar, but importantly different. This is a satisfying result.
3.3 Distributions of values
The contingency table works great for looking up individual data points with precision, but perhaps we can visualise the distributions in a better way.
subtopic_in_topic <- filtered_data %>%
count(hdbscan_clusters, kmeans_clusters, name = "hdb_subtotal") %>%
add_count(hdbscan_clusters, wt = hdb_subtotal, name = "hdb_total") %>%
mutate(hdb_prop = hdb_subtotal/hdb_total)density <- subtopic_in_topic %>%
ggplot(aes(x= hdb_prop)) +
geom_density(alpha = 0.8, fill = "midnightblue", color = "white") +
theme_minimal() +
labs(x = "Proportion of HDBScan cluster in kMeans clusters")histogram <- subtopic_in_topic %>%
ggplot(aes(x= hdb_prop)) +
geom_histogram(color = "white", fill = "midnightblue") +
theme_minimal() +
labs(x = "Proportion of HDBScan cluster in kMeans clusters", )Combine the density and the histogram
library(patchwork)
density +
histogram +
plot_layout(ncol = 1) +
patchwork::plot_annotation(title = stringr::str_wrap("Without slicing for the maximum value of proportion per HDBScan cluster, the distribution is extremely dense at 1.0, i.e. all of the HDBScan cluster fits into one kmeans cluster"))3.3.1 Highest Value Only
Slice the highest value per subtopic, as this tell us what is the maximum proportion of this subtopic which belongs to any of the topics; arguably a better way to cut the data up to figure out how good the fit is.
subtopic_in_topic %>%
slice_max(n = 1, order_by = hdb_prop, by = hdbscan_clusters) %>%
summarise(mean = mean(hdb_prop), median = median(hdb_prop), max = max(hdb_prop), min = min(hdb_prop))We can check out the histogram and the density for the sliced values just like we did for the unsliced
subtopic_in_topic %>%
slice_max(order_by = hdb_prop, n = 1, by = "hdbscan_clusters") %>%
ggplot(aes(x= hdb_prop)) +
geom_histogram(color = "white", fill = "midnightblue") +
theme_minimal() +
labs(x = "Proportion of max(HDBScan cluster in kMeans cluster)") +
subtopic_in_topic %>%
slice_max(order_by = hdb_prop, n = 1, by = "hdbscan_clusters") %>%
ggplot(aes(x= hdb_prop)) +
geom_density(alpha = 0.8, fill = "midnightblue", color = "white") +
theme_minimal() +
labs(x = "Proportion of max(HDBScan cluster in kMeans cluster)") +
plot_layout(ncol = 1) +
plot_annotation(title = stringr::str_wrap("When we slice the maximum value of proportion, such that we have one value per HDBScan cluster, our distribution is (predictably) further skewed towards proportion == 1.0, or all of HDBScan cluster fits inside one kmeans cluster."))What’s the insight?
In our sample, virtually all subtopics neatly into one topic. Furthermore, there is no subtopic which does not have > 50% of its values inside a single topic
3.4 Interpreting Fit with a Variation Matrix
If we want to get a general understanding of how our subtopics fit into topics, a heatmap or variation matrix may be more useful than a contingency table, and will complement the distributions we just looked at.
Each cell of the variation matrix will be coloured according to the % of that column’s subtopic which belongs to that row’s topic.
Given that we saw a high number of 0s in our contingency table and the density being highest at 1, we would expect to see both a lot of white (0 value) and a lot of yellow (close to 100%).
4 Understanding Relationships Between Subtopics & Topics
To better understand the intricate relationships between our topics and subtopics, we’ll visualize them using network plots.
We can transform our contingency table into a data frame, and our data frame into a graph
library(tidygraph)
library(ggraph)
library(ggrepel)
df <- contingency_table %>%
as.data.frame() %>%
tibble() %>%
filter(Freq != 0) %>%
# rename(from = Var2, to = Var1, weight = Freq) %>%
rename(from = Var1, to = Var2, weight = Freq) %>%
mutate(from = paste0("topic_", from)) %>%
mutate(from = str_wrap(from, 10))
# Convert df to tidygraph object
(graph <- as_tbl_graph(df))## # A tbl_graph: 131 nodes and 168 edges
## #
## # A directed acyclic simple graph with 2 components
## #
## # A tibble: 131 × 1
## name
## <chr>
## 1 topic_6
## 2 topic_1
## 3 topic_5
## 4 topic_0
## 5 topic_2
## 6 topic_9
## # ℹ 125 more rows
## #
## # A tibble: 168 × 3
## from to weight
## <int> <int> <int>
## 1 1 11 1076
## 2 1 12 4974
## 3 2 13 297
## # ℹ 165 more rows
4.1 Interactive Network
Rendering an interactive network will allow us to explore how topics are connected (they share edges with subtopics), and which subtopics are similar with respect to how they are distributed into topics (they share edges with the same topics).
Prepare the graph for visNetwork by adding some attributes in columns which visNetwork will automatically apply to our visualisation.
interactive_graph <- graph %>%
activate(nodes) %>%
mutate(
origin = if_else(name %in% df$from, 'from', 'to'),
color = ifelse(origin == "from", "black", "red"),
size = ifelse(origin == "from", 10, 5),
shape = ifelse(origin == "from", "circle", "star"),
font.color = ifelse(origin == "from", "white", "black"),
font.size = ifelse(origin == "from", 40, 20),
label = name
)
#Get edges and nodes for interactive viz
nodes_data <- as_tibble(interactive_graph %>% activate(nodes)) %>%
mutate(id = row_number(), .before = 1)
edges_data <- as_tibble(interactive_graph %>% activate(edges))
library(visNetwork)
#Customise nodes for better label visibility
nodes_data$font.size <- 20
nodes_data$font.face <- "bold"
nodes_data$shadow.enabled <- TRUE
nodes_data$shadow.size <- 10
nodes_data$shadow.x <- 2
nodes_data$shadow.y <- 2Now we can visualise using visNetwork. We’ll set a randomSeed so the layout is similar each time we run it.
visNetwork(nodes = nodes_data, edges = edges_data) %>%
visIgraphLayout(
randomSeed = 42,
layout = "layout_nicely",
physics = TRUE,
type = "square" #,
# smooth = TRUE #saves time if not calling with this.
) %>%
visOptions(
highlightNearest =
list(enabled = TRUE, degree = 2, hover = TRUE),
selectedBy = "origin",
nodesIdSelection = TRUE) %>%
# visPhysics(solver = "barnesHut") %>%
visPhysics(solver = "forceAtlas2Based") 5 Insights and Challenges
Generally, HDBScan clusters fit tidily into kMeans clusters. This should mean that we’re able to use both of them for different analyses; according to the demands of a given research project. It should also mean that they are complementary when used together.
At first glance, the subtopics which fit into topics appear to make sense - or they seem to be interpretable.
The contingency table gets difficult to work with when the number of HDBScan clusters gets high enough (50-100+) - simply visualising that many rows or columns is tricky.
The variation matrix can come in handy when the contingency table gets too big, but it’s not made to analyse specific points. It will also be difficult to work with when the number of columns get much higher. However, it useful as an indication of how well the HDBScan clusters fit into the kMeans clusters (lots of yellow = they fit well.).
The network plots help to identify relationships (direct and indirect), the challenge is how best to use them.
6 Automating Analysis with Functions
If we think the exploration has been useful, we may want to wrap everything up into functions to automate the analysis:
6.1 Summary statistics function
subtopic_summary_max <- function(df, subtopic_var, topic_var) {
subtopic_sym <- rlang::ensym(subtopic_var)
topic_sym <- rlang::ensym(topic_var)
#Count + proportion
data <- df %>%
dplyr::count(!!subtopic_sym, !!topic_sym, name = "subtopic_subtotal") %>%
dplyr::filter(subtopic_subtotal > 0) %>%
dplyr::add_count(!!subtopic_sym, wt = subtopic_subtotal, name = "subtopic_total") %>%
dplyr::mutate(proportion = subtopic_subtotal/subtopic_total)
message("Summary statistics counting the highest value for [proportion of subtopic in single topic]:")
#Slice & Summarise
data <- data %>%
dplyr::slice_max(n = 1, order_by = subtopic_subtotal, with_ties = FALSE, by = !!subtopic_sym) %>%
dplyr::summarise(mean = mean(proportion), median = median(proportion), min = min(proportion), max = max(proportion))
return(data)
}
filtered_data %>%
subtopic_summary_max(hdbscan_clusters, kmeans_clusters)6.3 Interactive Topic Network Function
6.3.1 Calculate Subtopic - Topic Distribution
calculate_subtopic_in_topic <- function(df, subtopic_var, topic_var) {
subtopic_sym <- rlang::ensym(subtopic_var)
topic_sym <- rlang::ensym(topic_var)
#Subset df for the two columns we need
data <- df %>% dplyr::select(!!subtopic_sym, !!topic_sym)
#For later graphing
colnames(data) <- c(".subtopic", ".topic")
data <- data %>% dplyr::count(.subtopic, .topic, name = "weight") %>%
dplyr::filter(weight != 0) %>%
dplyr::mutate(.topic = paste0("Topic ", .topic))
return(data)
}
(dator <- calculate_subtopic_in_topic(filtered_data, hdbscan_clusters, kmeans_clusters))6.3.2 Make Graph (with the output of the calculate values as a separate function)
make_subtopic_topic_graph <- function(output_of_calculate_subtopic_in_topic_function, subtopic_colour = "red", topic_colour = "black") {
data <- output_of_calculate_subtopic_in_topic_function
graph <- tidygraph::as_tbl_graph(data, directed = FALSE)
graph <- graph %>%
tidygraph::activate(nodes) %>%
dplyr::mutate(
origin = if_else(name %in% data$.subtopic, 'subtopic', 'topic'),
color = ifelse(origin == "subtopic", subtopic_colour, topic_colour),
size = ifelse(origin == "subtopic", 5, 10),
font.color = ifelse(origin == "subtopic","black", "white"),
font.size = ifelse(origin == "subtopic", 20, 40)
)
return(graph)
}
(my_graph <- make_subtopic_topic_graph(dator, subtopic_colour = "green", topic_colour = "midnightblue"))## # A tbl_graph: 131 nodes and 168 edges
## #
## # An undirected simple graph with 2 components
## #
## # A tibble: 131 × 6
## name origin color size font.color font.size
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 0 subtopic green 5 black 20
## 2 1 subtopic green 5 black 20
## 3 2 subtopic green 5 black 20
## 4 3 subtopic green 5 black 20
## 5 4 subtopic green 5 black 20
## 6 5 subtopic green 5 black 20
## # ℹ 125 more rows
## #
## # A tibble: 168 × 3
## from to weight
## <int> <int> <int>
## 1 1 122 1076
## 2 2 122 4974
## 3 3 123 297
## # ℹ 165 more rows
6.3.3 Visualise Graph (with output of the graph maker)
viz_subtopic_topic_interactive <- function(graph, layout = "layout_nicely", hover_depth = 1) {
edges_data <- graph %>%
tidygraph::activate(edges) %>%
as.data.frame() %>%
dplyr::mutate(color = "grey30")
nodes_data <- graph %>%
tidygraph::activate(nodes) %>%
dplyr::mutate(id = dplyr::row_number(), .before = 1) %>%
dplyr::mutate(shape =
ifelse(origin == "subtopic", "star", "circle"),
label = name) %>%
as.data.frame()
#Customise nodes for better label visibility
nodes_data$font.size <- 20
nodes_data$font.face <- "bold"
nodes_data$shadow.enabled <- TRUE
nodes_data$shadow.size <- 10
nodes_data$shadow.x <- 2
nodes_data$shadow.y <- 2
network <- visNetwork::visNetwork(
nodes = nodes_data,
edges = edges_data) %>%
visNetwork::visIgraphLayout(
# randomseed = 42,
physics = TRUE,
layout = layout,
type = "square") %>%
visNetwork::visOptions(
highlightNearest =
list(
enabled = TRUE,
degree = hover_depth,
hover = TRUE),
selectedBy = "origin",
nodesIdSelection = TRUE) %>%
visNetwork::visPhysics(solver = "forceAtlas2Based")
return(network)
}
viz_subtopic_topic_interactive(my_graph, hover_depth = 1)6.3.4 Quick Function
quick_interactive_viz <- function(df, subtopic_var, topic_var, layout = "layout_nicely", hover_depth = 1, topic_colour = "black", subtopic_colour = "red") {
subtopic_sym <- rlang::ensym(subtopic_var)
topic_sym <- rlang::ensym(topic_var)
subtopic_string <- rlang::as_string(subtopic_sym)
topic_string <- rlang::as_string(topic_sym)
calcs <- calculate_subtopic_in_topic(df, !!subtopic_sym, !!topic_sym)
graph <- make_subtopic_topic_graph(calcs, topic_colour = topic_colour, subtopic_colour = subtopic_colour)
network <- viz_subtopic_topic_interactive(graph, layout = layout, hover_depth = hover_depth)
return(network)
}
quick_interactive_viz(filtered_data, hdbscan_clusters, kmeans_clusters, topic_colour = "green", subtopic_colour = "midnightblue")7 Further work
- A Shiny application for interactive exploration of topics and their documents (?)
- Adjusted Rand Index for comparing multiple topic models
- Tools for merging subtopics into topics
- Better summary statistics to understand the fitting process at a glance, better visualisations too
- Experimentation with more HDBScan and kMeans parameters (probably aidded by Shiny app)
- Workflow recommendation & development (integration with other tools)
- End user outputs rather than researcher