We’re back at it again with the Yelp COVID dataset! With our tokenization to include bigrams, as well as examining the affordances of text networks made my existing set of banner text a natural choice for this independent analysis. As a quick recap, the main target of my inquiry has been the “COVID Banner”, a special open-ended text area that was located at the top of Yelp business pages during lockdown that businesses could use to share additional information as they, and the rest of mainstream society, changed in response to the pandemic. It has been clear so far that this open-form text area can, and has been, used for many different purposes, ranging from more straightforward quality-of-life changes such as new operating hours, advertising curbside pickup, etc. to more personal messages to clients and customers.
Now armed with bigrams and text networks, I again asked the following research question: What were the most common uses for the COVID banner function on Yelp Business pages during the initial 2020 lockdown?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ tidyr 1.2.0 ✓ forcats 0.5.1
## ✓ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidyr)
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
library(readxl)
library(SnowballC)
library(topicmodels)
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(ldatuning)
library(knitr)
library(LDAvis)
I drew from the same .xlsx file as before, pulling only the business ID and banner columns (and setting a seed for consistency this time). I also drew a sample of 1000, rather than the previous 300.
set.seed(588)
banners_raw <- read_xlsx("data/yelp-covid-dataset-banners.xlsx")|>
select(c(business_id,covid_banner)) |>
sample_n(1000)
banners_bigrams <- banners_raw |>
unnest_tokens(bigram, covid_banner, token = "ngrams", n = 2)
I looked at the top terms to see how things turned out.
banners_bigrams |>
count(bigram, sort = TRUE)
## # A tibble: 17,877 × 2
## bigram n
## <chr> <int>
## 1 we are 654
## 2 covid 19 333
## 3 of our 262
## 4 our customers 195
## 5 are open 163
## 6 we have 160
## 7 open for 146
## 8 we will 143
## 9 safety of 135
## 10 curbside pickup 131
## # … with 17,867 more rows
It looks like a number of these were diluting the good stuff, so I pulled out stop words. I decided to keep “covid 19”, however, as they might also help provide a nucleation site for other more interesting terms.
To pull out the stop words, I separated the two words into two columns, filtered both columns to remove the stop words, and then united them again. I would go on to use bigrams_filtered for the text network and bigrams_united for an LDA.
#separate bigrams
bigrams_separated <- banners_bigrams|>
separate(bigram, c("word1", "word2"), sep = " ")
#filter bigrams
bigrams_filtered <- bigrams_separated |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word)
#reunite bigrams
bigrams_united <- bigrams_filtered |>
unite(bigram, word1, word2, sep = " ")
I also double checked to see what the top bigrams looked like after the filter:
bigrams_counts <- bigrams_filtered |>
count(word1, word2, sort = TRUE)
bigrams_counts
## # A tibble: 4,859 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 covid 19 333
## 2 curbside pickup 131
## 3 social distancing 85
## 4 9am 9pm 52
## 5 9pm weekdays 52
## 6 safety measures 51
## 7 top priority 43
## 8 free shipping 39
## 9 shop online 39
## 10 stay home 38
## # … with 4,849 more rows
That’s a lot of “covid 19”, but actually less than I expected! I guess some businesses thought it would go without saying. I also noticed a few operating hours posted, but decided to keep them to see how they would show up (I made a point to remove them from the LDA in my last independent analysis, but thought it would be interesting to see with text networks or LDA this time).
For my text network, I first needed to turn these dataframes into graphable objects.
bigrams_graph <- bigrams_counts |>
graph_from_data_frame()
bigrams_graph
## IGRAPH 76785d5 DN-- 2873 4859 --
## + attr: name (v/c), n (e/n)
## + edges from 76785d5 (vertex names):
## [1] covid ->19 curbside ->pickup social ->distancing
## [4] 9am ->9pm 9pm ->weekdays safety ->measures
## [7] top ->priority free ->shipping shop ->online
## [10] stay ->home temporarily->closed updated ->hours
## [13] enhanced ->safety offering ->curbside challenging->times
## [16] stay ->safe store ->hours customers ->employees
## [19] operating ->hours everyone's ->safety day ->delivery
## [22] public ->health essential ->business extra ->cleaning
## + ... omitted several edges
I filtered the data to only include bigrams that occurred more than once.
bigrams_graph_filtered <- bigrams_counts |>
filter(n > 1) |>
graph_from_data_frame()
And here is the resulting graph:
set.seed(588)
a <- grid::arrow(type = "open", length = unit(.2, "inches"))
ggraph(bigrams_graph_filtered, layout = "kk") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "red", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
It produced quite a hairball! I thought about going and drawing a smaller sample, but instead ran a few iterations of the graph using higher minimum occurrences. After setting n>10 and making a few visual tweaks, I was able to produce this instead:
I ran an LDA similar to my previous independent analysis, with a few specific changes:
I used bigrams instead of unigrams
I used a larger sample size than before
I set a seed when sampling from the full dataset, so the text will have some variance from my previous analysis but will be consistent from here on out
I did not create any custom stop words when tidying, so that I could more directly compare the text network with the LDA
#create DTM
bigrams_dtm <- bigrams_united |>
count(business_id, bigram) |>
cast_dtm(business_id, bigram, n)
#extablish a good K
k_metrics <- FindTopicsNumber(
bigrams_dtm,
topics = seq(5, 20, by = 1),
metrics = "Griffiths2004",
method = "Gibbs",
control = list(),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL)
FindTopicsNumber_plot(k_metrics)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
#run LDA
bigrams_lda <- LDA(bigrams_dtm,k = 20,control = list(seed = 588))
#explore betas
terms(bigrams_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "sport clips" "covid 19" "covid 19" "covid 19"
## [2,] "operating hours" "business hours" "infection control" "7 p.m"
## [3,] "business visit" "contact delivery" "social distancing" "grooming salons"
## [4,] "checking inâ" "regular business" "ubereats doordash" "march 21"
## [5,] "clean certified" "call 702" "urgent care" "10 a.m"
## Topic 5 Topic 6 Topic 7
## [1,] "top priority" "extra precautions" "safe salon"
## [2,] "additional safety" "covid 19" "store locations"
## [3,] "covid 19" "taking extra" "salon commitment"
## [4,] "bookings implementing" "contact contactless" "curbside service"
## [5,] "cancellation fees" "contactless erentals" "free shipping"
## Topic 8 Topic 9 Topic 10
## [1,] "9am 9pm" "cdc guidelines" "safety measures"
## [2,] "9pm weekdays" "accepting appointments" "enhanced safety"
## [3,] "covid 19" "stay safe" "challenging times"
## [4,] "free shipping" "covid 19" "covid 19"
## [5,] "safely pick" "eye care" "curbside pickup"
## Topic 11 Topic 12 Topic 13
## [1,] "covid 19" "health officials" "public health"
## [2,] "virtual tours" "date information" "covid 19"
## [3,] "closely monitoring" "local health" "store hours"
## [4,] "social distancing" "operating hours" "discount tire"
## [5,] "distancing guidelines" "follow local" "essential services"
## Topic 14 Topic 15 Topic 16
## [1,] "essential business" "curbside pickup" "covid 19"
## [2,] "social distancing" "offer curbside" "temporarily closed"
## [3,] "7 days" "1 priority.some" "19 situation"
## [4,] "curbside pick" "accommodate contactless" "care deeply"
## [5,] "delivery services" "card online" "customers employees"
## Topic 17 Topic 18 Topic 19
## [1,] "covid 19" "covid 19" "covid 19"
## [2,] "contactless delivery" "precautionary measures" "closed due"
## [3,] "contact free" "ongoing precautionary" "customers employees"
## [4,] "office hours" "patient service" "updated hours"
## [5,] "bread app" "measures learn" "allowing customers"
## Topic 20
## [1,] "covid 19"
## [2,] "store hours"
## [3,] "6 p.m"
## [4,] "convenient store"
## [5,] "deliver curbside"
#tidy the lda
tidy_lda <- tidy(bigrams_lda)
This resulted in the following visualization for the top terms:
Looking at the text network versus the LDA, I was able to tease out some of the differences and relative strengths and weaknesses of the modeling methods. Let’s look again at the text network:
This network shows a central component that largely contains the “quality of life” terminology that I had been seeing in previous analyses, covering topics like updated hours, additional safety measures, greater flexibility for delivery and curbside pickup, and other accommodations that reflect adaptations to the pandemic (and keep the businesses looking relevant and attractive to customers, I presume). Some of the other components and, for lack of a better phrase, “bigram isolates” also reflect this, but must not have shared terms with the main component frequently enough to indicate a tie. Other isolates used more personal terms like I mentioned, most notably “challenging” and “times”. Others were from specific brands that were pulled into the sample like “pizza” and “hut”, or terms containing “.com” that got past the stop words filter. This shows to me that I might benefit from going through the original dataset and maybe remove brand-related stopwords ahead of time, as well as other artifacts. Overall, though, this gave me a much more concise visual for how most of these Covid banners were utilized across a wide range of businesses and the value that bigrams can add to such analyses.
Looking at the LDA’s proposed topics, I could see how it was able to corroborate the findings in the text network. Unsurprisingly, “covid 19” showed up among many of the topics, which helped me realize how seeing “covid” as a central node in a text network is much more useful than seeing it take up one of the 5 top terms in this visualization. It makes me wonder if it is better practice to keep a tokenized dataset the exact same for an LDA and text network, or to filter and otherwise wrangle them differently to play to the strengths of each method. Either way, if I were to run a bigram LDA again, I would remove “covid” and “19” to open up more information for each topic here. That said, you can still see distinct topics present that reflect many of the clusters in the text network: Some refer to changed business hours, others discuss national and local policy changes, etc. It is ultimately helpful to see a tidy grid of distinct topics to consider.
My conclusion from this analysis is that a majority of businesses within this sample used their Covid Banner to call attention to what they considered the most salient changes they have made to adapt themselves to the pandemic, with a smaller percentage using it for branding or personal appeal to customers. The LDA offers largely distinct types of “quality of life” changes that were highlighted in the banners.
In a future analysis, I would like to do several things:
Try performing a similar analysis with trigrams
Render a bigram or trigram text network in 3D and interactively, visualizing with a lower threshold (e.g. n>3) but allowing for greater end-user exploration
Find more meaningful and clear 2D depictions of text networks
Combine sentiment analysis with a text network, possibly visualizing ties to show how topics might relate via positive or negative sentiment