Looking through Yelp’s free datasets, I recently came about a special dataframe that they released specifically for how some businesses adapted their Yelp pages in response to the COVID-19 pandemic. Variables included a deidentified codename for each business, several boolean indicators for whether or not a business used a specific COVID or social distance measure (e.g. Grubhub options, a “temporarily closed” indicator, etc.), and an optional customizable PR statement which they called a “COVID banner.” It was the latter that became a focus of my sentiment analysis. I was curious about how businesses chose to toe the line of maintaining positive relationships with their customers while also acknowledging COVID-19 and its impact on society.
I developed three research questions for the purposes of this analysis:
What are the most common unigrams used in these businesses’ COVID banners?
What was the overall sentiment of these COVID banners?
For the businesses with the most negative and positive sentiment, were there any noticeable distinctions in how they used the other COVID/social distancing options available through Yelp?
I installed and loaded the following packages for my analysis:
install.packages(c("dplyr","readr","tidyr","writexl","readxl","tidytext","textdata","ggplot2","scales","wordcloud2"))
## Installing packages into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tidyr)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(wordcloud2)
I then loaded my dataframe from a .xlsx. In full disclosure, this file had to be doctored a bit outside of R. It was first unzipped, then converted from a .json file into a .txt, then imported into an .xlsx file. Due to the file size, I also filtered any businesses that did not include a COVID banner.
yelp_covid_banners <- read_xlsx("yelp-covid-dataset-banners.xlsx")
To isolate the banners for tokenization, I selected only the business ID and COVID banner columns.
yelp_covid_banners_tidy <-
yelp_covid_banners %>%
select(business_id,covid_banner)
Due to the sheer number of entries (even after filtering out businesses that didn’t have banners), I decided to pull a random sample of 100 businesses.
banners_sample <- sample_n(yelp_covid_banners_tidy,100)
I tokenized the resulting banner samples. I chose to use unigrams for this analysis.
banners_tokenized <-
unnest_tokens(banners_sample,
output = word,
input = covid_banner)
I then removed stop words, checking to see if there were any additional terms that needed to be added to the stop list. The standard stop_words list seemed to do the trick, but after modeling the tokens using a word cloud I found a few additional terms to omit.
banners_tokenized_2 <- anti_join(banners_tokenized,
stop_words,
by = "word")
banners_tokenized_2 %>%
count(word, sort = TRUE)
## # A tibble: 1,028 × 2
## word n
## <chr> <int>
## 1 safety 49
## 2 health 36
## 3 safe 36
## 4 19 34
## 5 covid 33
## 6 customers 29
## 7 delivery 27
## 8 visit 24
## 9 hours 19
## 10 priority 18
## # … with 1,018 more rows
my_stopwords <- c("19", "30", "7","3","https","2","8","480","24","6","4","9","1")
banners_tokenized_2 <-
banners_tokenized_2 %>%
filter(!word %in% my_stopwords)
I chose to use two sentiment analyses for this project: Afinn and NRC. The Afinn analysis would provide a positive or negative score by which to sort all businesses to answer my third research question, and the NRC analysis would provide some insight into other kinds of sentiment that might be found across this sample of banners.
afinn <- get_sentiments("afinn")
sentiment_afinn <- inner_join(banners_tokenized_2, afinn, by = "word")
summary_afinn <- sentiment_afinn %>%
group_by(business_id) %>%
summarise(sentiment = sum(value))
summary_afinn
## # A tibble: 71 × 2
## business_id sentiment
## <chr> <dbl>
## 1 _glMJT-AR1vNt-eatEdyeA 3
## 2 _qCS9WIXAmaOUfubYEo_mA 5
## 3 _yJIrvPrhqRGvDplyL_cGA 2
## 4 -6pKKkWhuxoQE3Oea_6_cA 0
## 5 0_Yc6blfiI9A6Q8bKjYLSQ 5
## 6 2cTZGvOwvQGkDKZTW7JjSA 2
## 7 2YBO1LEKIgyle0uX50u15Q 7
## 8 4FEb2SzmU_l7SCQAbvW5Hg 8
## 9 7H7RNLBVg6_Z2TK6riba_g 1
## 10 aXAqPM6SlA2_asaMcCONhg 1
## # … with 61 more rows
To see the top and bottom most businesses, I created two dataframes that were sorted in ascending and descending order by net sentiment.
bottom_afinn <- summary_afinn[order(summary_afinn$sentiment),]
top_afinn <- summary_afinn[order(summary_afinn$sentiment, decreasing=TRUE),]
nrc <- get_sentiments("nrc")
sentiment_nrc <- inner_join(banners_tokenized_2, nrc, by = "word")
summary_nrc <- count(sentiment_nrc, sentiment, sort = TRUE)
summary_nrc
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 positive 407
## 2 trust 219
## 3 anticipation 159
## 4 joy 100
## 5 negative 80
## 6 fear 38
## 7 sadness 36
## 8 surprise 21
## 9 anger 17
## 10 disgust 15
To answer my first research question, I modeled all tokens using wordcloud2. Due to the size of the sample, I first had to select the top 180 most-used tokens.
banners_tokenized_top <- banners_tokenized_2 %>%
count(word, sort = TRUE) %>%
slice(1:180)
Then a quick model showed the top terms.
wordcloud2(banners_tokenized_top)
The COVID-19 pandemic disrupted, among many aspects of society, the sustainability of innumerable businesses, particularly ones that relied on face-to-face contact with their users, clients and customers. Yelp, which remains one of the most popular crowdsourcing options for business listings and customer reviews, has a unique and rich vantage point on how such businesses have responded to national, state and local COVID restrictions in order to sustain themselves. By giving businesses the option to include a special banner for COVID-related updates and announcements, Yelp has opened a window to what a business has to say about the global pandemic. I was interested in how they were saying it.
I used a sentiment analysis using Afinn and NRC to help shed light on tone and overall positivity and negativity from my sample. I used wordcloud2 to model the most used unigrams across the entire sample.
My research questions were as follows:
What are the most common unigrams used in these businesses’ COVID banners?
The word cloud showed a blend of language describing public safety standards and more logistical information such as business hours and visitation methods. While the larger dataset implies that Yelp has offered some more explicit functions such as a place to show modified hours or closings, it seems that these banners might have been a space where businesses either reiterate or overwrite those other options. I think more could be found on that after analyzing what Yelp pages looked like on the user end during the time that this data was collected.
What was the overall sentiment of these COVID banners?
The Afinn analysis demonstrated that sentiment was predominantly positive across this sample of 100 banners. A quick look at the top and bottom 10 businesses in Afinn scores showed that 91% of the sample was positive, 3% was neutral and 6% was negative.
This was corroborated by the NRC analysis, which returned the following:
| Sentiment | N |
| positive | 375 |
| trust | 190 |
| anticipation | 169 |
| negative | 83 |
| joy | 78 |
| fear | 49 |
| sadness | 38 |
| surprise | 28 |
| anger | 22 |
| disgust | 22 |
It is interesting to note that after the majority of positive sentiment (“positive” and “trust), it became a blend of positive and negative terms, with”negative” turning up 83 times among the 100 banners. Although the majority of the language was, again, positive, fear, anticipation, and other negative terms appeared to be lurking between and even within some of the messaging.
For the businesses with the most negative and positive sentiment, were there any noticeable distinctions in how they used the other COVID/social distancing options available through Yelp?
Unfortunately I was not able to find the original identities of these businesses to shed more light on sentiment. Instead, I looked at the additional categories of data within this dataset to see if there were any distinct patterns in how Yelp’s COVID resources were utilized. In comparing the ten most negative banners to the ten most positive, I was not able to find any meaningful patterns in COVID resource use. Without using any additional mathematical modeling it seemed equally likely that a given business would, or would not, use some of these assets, such as offering virtual services, providing a “call to action” button on their page, and give Grubhub support, etc. The only category that was completely homogeneous was the “temporary_closed_until” option, which every business listed as “false”. I presume this is because most businesses that would be providing a specific COVID banner would likely still be in business and are interested in communicating with their customers.
This lack of correlation between COVID asset use and sentiment could be due to the high variance in business type represented in this dataset. While I was not able to formally identify these businesses, many of the banners offer implications to what kinds of businesses they were. This sample possibly included restaurants, a dog groomer, a tattoo/piercing parlor, T-Mobile, a hair salon, a healthcare office, and others. These different businesses would have to adapt in very different ways to a pandemic (e.g. a hair salon wouldn’t make use of Grubhub support).
Another source of variance might be the open-endedness of the COVID banner itself. Some businesses included an impassioned letter directly to their customers that acknowledged how trying the pandemic was, while others simply used it for updates like “we also have bulk quantities of sanitizer available for $35/gallon.” Again, the business type likely influenced the kind of information being shared here, but it was entirely up to the author what they decided to use this white space for.
In future research I would like to perform a similar analysis within a larger sample, but a narrower selection of businesses—a favored candidate would be restaurants, since they appear to be among the hardest hit by the pandemic and have tighter health codes to consider as well as pandemic restrictions. I would also like to try performing a chi square analysis of the other COVID resource categories on sentiment.