In my previous independent analysis, I attempted to analyze the sentiment of various messages given out by businesses’ Yelp profiles during peak lockdown due to COVID-19. The data was collected from “covid banners”, which were special spaces on a given business page that could be used for anything they deemed relevant to how their business was responding to pandemic restrictions. I soon found out that having a completely open-ended “prompt” led to a wide variance of sentiment, due to the wide variance of communications. In my sample, I found that some businesses used the opportunity to simply give updated operating hours, while others offered more personal messages to their customers about the trying times, among many other kinds of messages. It was, more or less, like the sheets of paper that you sometimes see slapped onto storefronts in real life, which offer temporary messages to the public about (hopefully) short-term changes. Choosing this data for a topic modeling exercise seemed like the next logical step.
For this assignment, I wanted to identify latent topics across this wide range of banner data using unsupervised topic analysis. To do this, I ran and compared Latent Dirichlet allocation (LDA) and structural topic modeling (STM) analyses, using covariance data for the latter.
I first installed the usual libraries that we’ve encountered so far for topic modeling:
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
install.packages("tidytext", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
install.packages("readxl", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
install.packages("SnowballC", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
install.packages("topicmodels", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
install.packages("stm", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
install.packages("ldatuning", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
install.packages("knitr", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
install.packages("LDAvis", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/lj/3gxxzjkj4tg_hkk4dm99lpww0000gn/T//RtmpZTyEsi/downloaded_packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(readxl)
library(SnowballC)
library(topicmodels)
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(ldatuning)
library(knitr)
library(LDAvis)
I selected and imported my raw data from the Yelp database, including each business’s unique ID, the content of their banner, and a dummy variable for whether or not a business included a “call to action” option on their Yelp page. The latter variable would be used for my Structural Topic Model.
banners_raw <- read_xlsx("data/yelp-covid-dataset-banners.xlsx")|>
select(c(business_id,covid_banner, call_to_action_enabled)) |>
sample_n(300)
I first tokenized the text of each banner, filtering for the usual stop_words:
banners_tokenized <- banners_raw |>
unnest_tokens(output = word,
input = covid_banner) |>
anti_join(stop_words, by = "word")
After looking at the top terms, I also struck some common words and url artifacts:
count(banners_tokenized, word, sort = TRUE)
## # A tibble: 1,867 × 2
## word n
## <chr> <int>
## 1 customers 152
## 2 safety 118
## 3 19 98
## 4 covid 97
## 5 hours 93
## 6 store 85
## 7 delivery 81
## 8 online 75
## 9 health 73
## 10 safe 72
## # … with 1,857 more rows
banners_tokenized <- banners_tokenized |>
filter(word != c("covid",
"19",
"https"))
## Warning in word != c("covid", "19", "https"): longer object length is not a
## multiple of shorter object length
My data was ready to cast to a document terms matrix (DTM) for the LDA.
banners_dtm <- banners_tokenized |>
count(business_id, word) |>
cast_dtm(business_id, word, n)
I also prepared the text for my STM as a temp file for later extraction. I chose to stem the words after seeing how many similar words showed up in my LDA. You’ll see what I mean soon.
temp <- textProcessor(banners_raw$covid_banner,
metadata = banners_raw,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
Using FindTopicsNumber to calculate the most preferable number of topics, I identified 15 as the most coherent option.
k_metrics <- FindTopicsNumber(
banners_dtm,
topics = seq(5, 20, by = 1),
metrics = "Griffiths2004",
method = "Gibbs",
control = list(),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL)
FindTopicsNumber_plot(k_metrics)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
I pulled a random number (1:1000) for the seed, and ran an LDA looking for 15 topics:
banners_lda <- LDA(banners_dtm,
k = 15,
control = list(seed = 524))
With a quick look at the betas, I see a lot of repeating terms. This echoes Chelsey Hill’s observation that “with LDA, topics tend to lack exclusivity–meaning that the top terms tend to be the same in more than one topic.” But we’ll see how it looks after I worked on the visualization a bit more.
terms(banners_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "customers" "9am" "support" "customers" "due" "wait"
## [2,] "store" "free" "safety" "health" "delivery" "website"
## [3,] "taking" "stores" "covidâ" "safety" "closed" "time"
## [4,] "service" "online" "operations" "hours" "ongoing" "store"
## [5,] "social" "9pm" "travel" "contact" "measures" "clips"
## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12
## [1,] "cdc" "customers" "delivery" "store" "visit" "health"
## [2,] "distancing" "store" "hours" "stores" "safety" "safety"
## [3,] "social" "19" "phone" "customers" "safe" "continue"
## [4,] "call" "covid" "online" "reopen" "local" "customers"
## [5,] "times" "employees" "call" "service" "website" "time"
## Topic 13 Topic 14 Topic 15
## [1,] "curbside" "salon" "safety"
## [2,] "pickup" "appointment" "health"
## [3,] "contactless" "wear" "social"
## [4,] "food" "safe" "cleaning"
## [5,] "1" "clients" "safe"
I made the LDA into a tidy object, and visualized it with ggplot:
tidy_lda <- tidy(banners_lda)
top_terms_lda <- tidy_lda |>
group_by(topic) |>
slice_max(beta, n = 5, with_ties = FALSE) |>
ungroup() |>
arrange(topic, -beta)
top_terms_lda |>
mutate(term = reorder_within(term, beta, topic)) |>
group_by(topic, term) |>
arrange(desc(beta)) |>
ungroup() |>
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
Just to see how it might change things, I removed “customers” and “safety” which seemed to come up the most often among the topics:
banners_tokenized_2 <- banners_tokenized |>
filter(word != c("safety", "customers"))
banners_dtm_2 <-banners_tokenized_2 |>
count(business_id, word) |>
cast_dtm(business_id, word, n)
banners_lda_2 <- LDA(banners_dtm_2,
k = 13,
control = list(seed = 524))
tidy_lda_2 <- tidy(banners_lda_2)
top_terms_lda_2 <- tidy_lda_2 |>
group_by(topic) |>
slice_max(beta, n = 5, with_ties = FALSE) |>
ungroup() |>
arrange(topic, -beta)
top_terms_lda_2 |>
mutate(term = reorder_within(term, beta, topic)) |>
group_by(topic, term) |>
arrange(desc(beta)) |>
ungroup() |>
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
It looks like some word stemming could be useful in this model. After taking “community” out, “communities” has still stuck around. Granted, it is across fewer topics. Additionally, I see some artifacts from my sample that, in a future project, I would filterduring wrangling, such as the potentially french words and letters, that garbled version of “covid,” and others.
For comparison, I also ran an LDA with K = 30, rather than the optimal number (according to FindTopicsNumber:
banners_lda_3 <- LDA(banners_dtm_2,
k = 30,
control = list(seed = 524))
tidy_lda_3 <- tidy(banners_lda_3)
top_terms_lda_3 <- tidy_lda_3 |>
group_by(topic) |>
slice_max(beta, n = 5, with_ties = FALSE) |>
ungroup() |>
arrange(topic, -beta)
top_terms_lda_3 |>
mutate(term = reorder_within(term, beta, topic)) |>
group_by(topic, term) |>
arrange(desc(beta)) |>
ungroup() |>
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
While there was more differentiation of specific terms, there are many more terms that overlapped across multiple topics. I could see how coherence could be lost as K increases.
To build my STM, I needed to extract my temp file into three documents: meta vocab, and docs:
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
With those ready, I built my STM with call_to_action_enabled as my covariate, which would ostensibly create a better-fitting series of topics to the data. I also kept K at 15 to keep it consistent with the LDA analysis.
banners_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ call_to_action_enabled,
K=15,
max.em.its=25,
verbose = FALSE)
plot.STM(banners_stm, n = 5)
I found it very gratifying to see how the topics pulled together clusters of words that reflected my own casual labels for the various “purposes” of a Yelp covid banner. By “reading the tea leaves” here, I see some common topics across both the LDA and STM models: Giving business hour updates, describing business-wide health and safety measures, drive-through or curbside pickup options, assurances of safety and comfort, among others.
In this particular instance, I can see the case for using stemming. In the STM model, which utilized stemming, I saw much less overlap and co-occurrence of terms since they were collapsed down to their basic units. It was also nice to see a superlative ranking of the occurrence of these terms within a topic for the STM model. Seeing the highest topic as topic 10, which could be broadly described as a “call for business,” makes a lot of sense to me.
I feel like some of the human touch is lost from these models, though. It doesn’t seem like the more personal or emotional covid banners would show up across these topics. Perhaps it’s due to my inexperience in tuning and interpreting the model outputs, but also I imagine that the “bag of words” approach means a certain loss in fidelity that would be carried out through grammar and syntax.
I noticed that “vegas” showed up in my LDA model, which is pretty neat considering that the locations of the businesses in this dataset have been broadly deidentified. In their FAQ, Yelp acknowledged that their data was “centered on Montreal, Calgary, Toronto, Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, and Cleveland.” So that’s one of them!
This data source is undeniably rich, and I have a feeling that I’ll be able to use it again for future independent analyses. I look forward to becoming better at sampling and polishing this data in addition to learning more analytic methods down the road.