Overview
Our implementation of latent Dirichlet allocation with Gibbs sampling is up and running on the HP Brandwatch query. There is more cleaning of the dataset, i.e. more stopwords to include, but this serves as POC.
Overview of Brandwatch Query
Original Dataset:
- Size: >600MB
- Documents: >180,000
Since this is too large of a dataset, I only selected domains that appear more than 18 times in the query and took the top 10 most recent articles from each of the domains.
Subsetted Dataset:
- Size: >30MB
- Documents: >5,000
Comparison of Documents
library(tidytext)
library(tidyverse)
library(stringr)
library(readxl)
library(quanteda)
library(topicmodels)
library(ggplot2)
data("stop_words")
source("custom_functions.R")
load("output/hp_models.RData")
doc_tops_lda <- tidy(hp_lda, matrix = "gamma") %>%
group_by(document) %>%
top_n(1, gamma)
doc_tops_gibbs <- tidy(hp_gibbs, matrix = "gamma") %>%
group_by(document) %>%
top_n(1, gamma)
doc_tops <- bind_rows(list("LDA" = doc_tops_lda, "Gibbs" = doc_tops_gibbs), .id = "Model")
The problem we had using LDA before is that it lumped documents mostly into one topic. As you can see in the chart below, the topics are more distinct and separated when using Gibbs sampling. This makes intuitive sense to me given that we are essentially telling Brandwatch to look at one topic.
ggplot(doc_tops, aes(x=topic)) +
#geom_histogram(bins=20) +
geom_bar(aes(fill = topic)) +
#scale_fill_discrete(palette = "Blues") +
facet_grid(Model~.) +
labs(title = "Distribution of Documents into Topics by Model",
x = "Topic Number (1-20)",
y = "Number of Documents")

Top Terms by Topic
Gibbs Model
Below is a chart of top topics. Right now the most “popular” topics are 3, 11, 12, 16, and 19. But looking at these topics tells me I have to adjust the stopwords list for the next iteration.
hp_topics_gibbs <- tidy(hp_gibbs, matrix = "beta")
hp_top_terms_gibbs <- hp_topics_gibbs %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
hp_top_terms_gibbs %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
labs(title = "Gibbs: Top 10 Terms for Each Topic",
y = "term score (beta)" )

chart.name <- paste("Gibbs_","top-10-words_",topics,"-topics_",today,".png",sep = "")
ggsave(file.path("charts",chart.name), width = 14, height = 10)
LDA Model
Just for reference, here is are the top terms for each topic using basic LDA. Topic 13 is the winner here which overlaps a lot with Gibb’s Topic 3.
hp_topics_lda <- tidy(hp_lda, matrix = "beta")
hp_top_terms_lda <- hp_topics_lda %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
hp_top_terms_lda %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
labs(title = "LDA - Top 10 Terms for Each Topic",
y = "term score (beta)" )

chart.name <- paste("LDA_","top-10-words_",topics,"-topics_",today,".png",sep = "")
ggsave(file.path("charts",chart.name), width = 14, height = 10)
LS0tDQp0aXRsZTogIkhQIE9FTSAtIFRvcGljIE1vZGVsIENvbXBhcmlzb24iDQphdXRob3I6ICJBbmRlcnMgU3dhbnNvbiINCmRhdGU6ICJgciBmb3JtYXQoU3lzLnRpbWUoKSwgJyVkICVCICVZJylgIg0Kb3V0cHV0OiANCiAgaHRtbF9ub3RlYm9vazoNCiAgICBkZXB0aDogMw0KICAgIHRvYzogeWVzDQogICAgdGhlbWU6IHNwYWNlbGFiDQogICAgY29kZV9mb2xkaW5nOiBoaWRlDQotLS0NCg0KIyBPdmVydmlldw0KDQpPdXIgaW1wbGVtZW50YXRpb24gb2YgbGF0ZW50IERpcmljaGxldCBhbGxvY2F0aW9uIHdpdGggR2liYnMgc2FtcGxpbmcgaXMgdXAgYW5kIHJ1bm5pbmcgb24gdGhlIEhQIEJyYW5kd2F0Y2ggcXVlcnkuIFRoZXJlIGlzIG1vcmUgY2xlYW5pbmcgb2YgdGhlIGRhdGFzZXQsIGkuZS4gbW9yZSBzdG9wd29yZHMgdG8gaW5jbHVkZSwgYnV0IHRoaXMgc2VydmVzIGFzICBQT0MuDQoNCiMgTWV0YXBob3IgZm9yIExEQQ0KDQpbSW50cm9kdWN0aW9uIHRvIExhdGVudCBEaXJpY2hsZXQgQWxsb2NhdGlvbl0oaHR0cDovL2Jsb2cuZWNoZW4ubWUvMjAxMS8wOC8yMi9pbnRyb2R1Y3Rpb24tdG8tbGF0ZW50LWRpcmljaGxldC1hbGxvY2F0aW9uLykgKHNjcm9sbCBkb3duIHRvICdMYXltYW4ncyBFeHBsYW5hdGlvbicgc2VjdGlvbikuDQoNCiMgT3ZlcnZpZXcgb2YgQnJhbmR3YXRjaCBRdWVyeQ0KDQpPcmlnaW5hbCBEYXRhc2V0Og0KDQoqIFNpemU6ID42MDBNQg0KKiBEb2N1bWVudHM6ID4xODAsMDAwDQoNCg0KU2luY2UgdGhpcyBpcyB0b28gbGFyZ2Ugb2YgYSBkYXRhc2V0LCBJIG9ubHkgc2VsZWN0ZWQgZG9tYWlucyB0aGF0IGFwcGVhciBtb3JlIHRoYW4gMTggdGltZXMgaW4gdGhlIHF1ZXJ5IGFuZCB0b29rIHRoZSB0b3AgMTAgbW9zdCByZWNlbnQgYXJ0aWNsZXMgZnJvbSBlYWNoIG9mIHRoZSBkb21haW5zLg0KDQpTdWJzZXR0ZWQgRGF0YXNldDoNCg0KKiBTaXplOiA+MzBNQg0KKiBEb2N1bWVudHM6ID41LDAwMA0KDQoNCiMgQ29tcGFyaXNvbiBvZiBEb2N1bWVudHMNCg0KYGBge3IgbGlicmFyaWVzLCBtZXNzYWdlPUZBTFNFfQ0KbGlicmFyeSh0aWR5dGV4dCkNCmxpYnJhcnkodGlkeXZlcnNlKQ0KbGlicmFyeShzdHJpbmdyKQ0KbGlicmFyeShyZWFkeGwpDQpsaWJyYXJ5KHF1YW50ZWRhKQ0KbGlicmFyeSh0b3BpY21vZGVscykNCmxpYnJhcnkoZ2dwbG90MikNCmRhdGEoInN0b3Bfd29yZHMiKQ0Kc291cmNlKCJjdXN0b21fZnVuY3Rpb25zLlIiKQ0KdG9kYXkgPC0gU3lzLkRhdGUoKQ0KdG9waWNzIDwtIDIwDQpgYGANCg0KDQoNCmBgYHtyfQ0KbG9hZCgib3V0cHV0L2hwX21vZGVscy5SRGF0YSIpDQpgYGANCg0KDQpgYGB7cn0NCg0KZG9jX3RvcHNfbGRhIDwtIHRpZHkoaHBfbGRhLCBtYXRyaXggPSAiZ2FtbWEiKSAlPiUNCiAgICBncm91cF9ieShkb2N1bWVudCkgJT4lDQogICAgdG9wX24oMSwgZ2FtbWEpDQoNCmRvY190b3BzX2dpYmJzIDwtIHRpZHkoaHBfZ2liYnMsIG1hdHJpeCA9ICJnYW1tYSIpICU+JQ0KICAgIGdyb3VwX2J5KGRvY3VtZW50KSAlPiUNCiAgICB0b3BfbigxLCBnYW1tYSkNCg0KDQpkb2NfdG9wcyA8LSBiaW5kX3Jvd3MobGlzdCgiTERBIiA9IGRvY190b3BzX2xkYSwgIkdpYmJzIiA9IGRvY190b3BzX2dpYmJzKSwgLmlkID0gIk1vZGVsIikNCg0KDQoNCmBgYA0KDQpUaGUgcHJvYmxlbSB3ZSBoYWQgdXNpbmcgTERBIGJlZm9yZSBpcyB0aGF0IGl0IGx1bXBlZCBkb2N1bWVudHMgbW9zdGx5IGludG8gb25lIHRvcGljLiBBcyB5b3UgY2FuIHNlZSBpbiB0aGUgY2hhcnQgYmVsb3csIHRoZSB0b3BpY3MgYXJlIG1vcmUgZGlzdGluY3QgYW5kIHNlcGFyYXRlZCB3aGVuIHVzaW5nIEdpYmJzIHNhbXBsaW5nLiBUaGlzIG1ha2VzIGludHVpdGl2ZSBzZW5zZSB0byBtZSBnaXZlbiB0aGF0IHdlIGFyZSBlc3NlbnRpYWxseSB0ZWxsaW5nIEJyYW5kd2F0Y2ggdG8gbG9vayBhdCBvbmUgdG9waWMuDQoNCmBgYHtyfQ0KDQpnZ3Bsb3QoZG9jX3RvcHMsIGFlcyh4PXRvcGljKSkgKw0KICAgICNnZW9tX2hpc3RvZ3JhbShiaW5zPTIwKSArDQogICAgZ2VvbV9iYXIoYWVzKGZpbGwgPSB0b3BpYykpICsNCiAgICAjc2NhbGVfZmlsbF9kaXNjcmV0ZShwYWxldHRlID0gIkJsdWVzIikgKw0KICAgIGZhY2V0X2dyaWQoTW9kZWx+LikgKw0KICAgIGxhYnModGl0bGUgPSAiRGlzdHJpYnV0aW9uIG9mIERvY3VtZW50cyBpbnRvIFRvcGljcyBieSBNb2RlbCIsDQogICAgICAgICB4ID0gIlRvcGljIE51bWJlciAoMS0yMCkiLA0KICAgICAgICAgeSA9ICJOdW1iZXIgb2YgRG9jdW1lbnRzIikNCg0KYGBgDQoNCg0KIyBUb3AgVGVybXMgYnkgVG9waWMNCg0KIyMgR2liYnMgTW9kZWwNCg0KQmVsb3cgaXMgYSBjaGFydCBvZiB0b3AgdG9waWNzLiBSaWdodCBub3cgdGhlIG1vc3QgInBvcHVsYXIiIHRvcGljcyBhcmUgMywgMTEsIDEyLCAxNiwgYW5kIDE5LiBCdXQgbG9va2luZyBhdCB0aGVzZSB0b3BpY3MgdGVsbHMgbWUgSSBoYXZlIHRvIGFkanVzdCB0aGUgc3RvcHdvcmRzIGxpc3QgZm9yIHRoZSBuZXh0IGl0ZXJhdGlvbi4NCg0KYGBge3IsIGZpZy53aWR0aD0xNCxmaWcuaGVpZ2h0PTEwfQ0KaHBfdG9waWNzX2dpYmJzIDwtIHRpZHkoaHBfZ2liYnMsIG1hdHJpeCA9ICJiZXRhIikNCg0KaHBfdG9wX3Rlcm1zX2dpYmJzIDwtIGhwX3RvcGljc19naWJicyAlPiUNCiAgZ3JvdXBfYnkodG9waWMpICU+JQ0KICB0b3BfbigxMCwgYmV0YSkgJT4lDQogIHVuZ3JvdXAoKSAlPiUNCiAgYXJyYW5nZSh0b3BpYywgLWJldGEpDQoNCmhwX3RvcF90ZXJtc19naWJicyAlPiUNCiAgbXV0YXRlKHRlcm0gPSByZW9yZGVyKHRlcm0sIGJldGEpKSAlPiUNCiAgZ2dwbG90KGFlcyh0ZXJtLCBiZXRhLCBmaWxsID0gZmFjdG9yKHRvcGljKSkpICsNCiAgZ2VvbV9jb2woc2hvdy5sZWdlbmQgPSBGQUxTRSkgKw0KICBmYWNldF93cmFwKH4gdG9waWMsIHNjYWxlcyA9ICJmcmVlIikgKw0KICBjb29yZF9mbGlwKCkgKw0KICBsYWJzKHRpdGxlID0gIkdpYmJzOiBUb3AgMTAgVGVybXMgZm9yIEVhY2ggVG9waWMiLA0KICAgICAgIHkgPSAidGVybSBzY29yZSAoYmV0YSkiICkNCg0KY2hhcnQubmFtZSA8LSBwYXN0ZSgiR2liYnNfIiwidG9wLTEwLXdvcmRzXyIsdG9waWNzLCItdG9waWNzXyIsdG9kYXksIi5wbmciLHNlcCA9ICIiKQ0KDQpnZ3NhdmUoZmlsZS5wYXRoKCJjaGFydHMiLGNoYXJ0Lm5hbWUpLCB3aWR0aCA9IDE0LCBoZWlnaHQgPSAxMCkNCmBgYA0KDQojIyBMREEgTW9kZWwNCg0KSnVzdCBmb3IgcmVmZXJlbmNlLCBoZXJlIGlzIGFyZSB0aGUgdG9wIHRlcm1zIGZvciBlYWNoIHRvcGljIHVzaW5nIGJhc2ljIExEQS4gVG9waWMgMTMgaXMgdGhlIHdpbm5lciBoZXJlIHdoaWNoIG92ZXJsYXBzIGEgbG90IHdpdGggR2liYidzIFRvcGljIDMuDQoNCmBgYHtyLCBmaWcud2lkdGg9MTQsZmlnLmhlaWdodD0xMH0NCmhwX3RvcGljc19sZGEgPC0gdGlkeShocF9sZGEsIG1hdHJpeCA9ICJiZXRhIikNCg0KaHBfdG9wX3Rlcm1zX2xkYSA8LSBocF90b3BpY3NfbGRhICU+JQ0KICBncm91cF9ieSh0b3BpYykgJT4lDQogIHRvcF9uKDEwLCBiZXRhKSAlPiUNCiAgdW5ncm91cCgpICU+JQ0KICBhcnJhbmdlKHRvcGljLCAtYmV0YSkNCg0KaHBfdG9wX3Rlcm1zX2xkYSAlPiUNCiAgbXV0YXRlKHRlcm0gPSByZW9yZGVyKHRlcm0sIGJldGEpKSAlPiUNCiAgZ2dwbG90KGFlcyh0ZXJtLCBiZXRhLCBmaWxsID0gZmFjdG9yKHRvcGljKSkpICsNCiAgZ2VvbV9jb2woc2hvdy5sZWdlbmQgPSBGQUxTRSkgKw0KICBmYWNldF93cmFwKH4gdG9waWMsIHNjYWxlcyA9ICJmcmVlIikgKw0KICBjb29yZF9mbGlwKCkgKw0KICBsYWJzKHRpdGxlID0gIkxEQSAtIFRvcCAxMCBUZXJtcyBmb3IgRWFjaCBUb3BpYyIsDQogICAgICAgeSA9ICJ0ZXJtIHNjb3JlIChiZXRhKSIgKQ0KDQpjaGFydC5uYW1lIDwtIHBhc3RlKCJMREFfIiwidG9wLTEwLXdvcmRzXyIsdG9waWNzLCItdG9waWNzXyIsdG9kYXksIi5wbmciLHNlcCA9ICIiKQ0KDQpnZ3NhdmUoZmlsZS5wYXRoKCJjaGFydHMiLGNoYXJ0Lm5hbWUpLCB3aWR0aCA9IDE0LCBoZWlnaHQgPSAxMCkNCmBgYA0KDQoNCg0KDQoNCg0K