The following presents an analysis of Boca Del Cielo/Cocoa Grinder’s Google Reviews from December 2015 to May 2022.
This analysis hopes to provide insight into customer sentiment over time and identify possible patterns in customer perception of the store’s services and quality of food. Two features of the data - numeric ratings and text reviews - have been thoroughly examined. The first part of this report describes trends in overall rating over various units of time - months, quarters and years. The analysis hones in on monthly ratings, but also conducts sensitivity analysis. Finally, I describes limitations of the data.
# libraries for reading in, processing and plotting data
library(readr) ; library(tidyverse) ; library(lubridate) ; library(ggplot2) ; library(data.table)
# libraries for text and sentiment analysis
library(tidytext) ; library(stopwords) ; library(textdata)
library(ggh4x) # library of the facet axes labels
library(knitr) ; library(kableExtra) # for tables
library(showtext) # for font families
font_add("Palatino", "pala.ttf")# original data downloaded
df <- read.csv("C:/Users/John Jay REC/Downloads/20220512034502ade2.csv")# look at structure of data
glimpse(df)
## Rows: 473
## Columns: 24
## $ query <chr> "ChIJe4mGUPVPwokR9OqFnqrWrxY", "Ch…
## $ name <chr> "Boca del cielo", "Boca del cielo"…
## $ google_id <chr> "0x89c24ff55086897b:0x16afd6aa9e85…
## $ place_id <chr> "ChIJe4mGUPVPwokR9OqFnqrWrxY", "Ch…
## $ location_link <chr> "https://www.google.com/maps/place…
## $ reviews_link <chr> "https://search.google.com/local/r…
## $ reviews_per_score <chr> "{\"1\": 32, \"2\": 20, \"3\": 36,…
## $ rating <dbl> 4.2, 4.2, 4.2, 4.2, 4.2, 4.2, 4.2,…
## $ review_id <chr> "ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBE…
## $ author_link <chr> "https://www.google.com/maps/contr…
## $ author_title <chr> "Patricia Medard", "Janice Lok", "…
## $ author_id <dbl> 1.17892e+20, 1.09203e+20, 1.02042e…
## $ author_image <chr> "https://lh3.googleusercontent.com…
## $ review_text <chr> "On of my favorite cafes to go to …
## $ review_img_url <chr> "https://lh5.googleusercontent.com…
## $ owner_answer <chr> "", "", "", "", "Hey steve what di…
## $ owner_answer_timestamp <int> NA, NA, NA, NA, 1644082058, NA, NA…
## $ owner_answer_timestamp_datetime_utc <chr> "", "", "", "", "2/5/2022 17:27", …
## $ review_link <chr> "https://www.google.com/maps/revie…
## $ review_rating <int> 5, 2, 5, 4, 3, 3, 5, 5, 1, 5, 5, 4…
## $ review_timestamp <int> 1651176555, 1644096998, 1641314125…
## $ review_datetime_utc <chr> "4/28/2022 20:09", "2/5/2022 21:36…
## $ review_likes <int> 3, 0, 1, 3, 1, 0, 0, 0, 1, 0, 0, 0…
## $ reviews_id <dbl> 1.63476e+18, 1.63476e+18, 1.63476e…df1 <- df %>%
# save necessary fields
select(review_id, review_text, review_rating, review_datetime_utc) %>%
# recode blanks to NA
mutate_at(vars(review_text), na_if,"")473 reviews were scraped on May 11th with the help of this site. May 10th 2022 was the last review, when the data was scraped.
Below, we prep the data further to create time units by which to aggregate the ratings.
glimpse(df1) # `review_datetime_utc` is in character format
## Rows: 473
## Columns: 4
## $ review_id <chr> "ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE", "ChZDSUhNMG…
## $ review_text <chr> "On of my favorite cafes to go to in Brooklyn. The…
## $ review_rating <int> 5, 2, 5, 4, 3, 3, 5, 5, 1, 5, 5, 4, 1, 5, 5, 4, 4,…
## $ review_datetime_utc <chr> "4/28/2022 20:09", "2/5/2022 21:36", "1/4/2022 16:…# create date and times
df1$review_date <- as.POSIXct(df1$review_datetime_utc, format = "%m/%d/%Y %H:%M")
df1$review_year <- format(df1$review_date, "%Y")
df1$review_month <- format(df1$review_date, "%B")
df1$review_week <- format(df1$review_date, "%W")
df1$review_quarter <- as.character(quarter(df1$review_date)) glimpse(df1[,c(4:9)]) # view format + values of the date-time vars created
## Rows: 473
## Columns: 6
## $ review_datetime_utc <chr> "4/28/2022 20:09", "2/5/2022 21:36", "1/4/2022 16:…
## $ review_date <dttm> 2022-04-28 20:09:00, 2022-02-05 21:36:00, 2022-01…
## $ review_year <chr> "2022", "2022", "2022", "2021", "2022", "2021", "2…
## $ review_month <chr> "April", "February", "January", "December", "Febru…
## $ review_week <chr> "17", "05", "01", "49", "05", "52", "16", "17", "5…
## $ review_quarter <chr> "2", "1", "1", "4", "1", "4", "2", "2", "4", "2", …
# year, month, week and quarter are in character formatdf1 %>%
group_by(review_year) %>%
summarise(num.reviews = n(), mean = mean(review_rating)) %>%
kbl(caption = "Average yearly ratings", digits = 2, align = "l") %>%
kable_minimal(full_width = F)| review_year | num.reviews | mean |
|---|---|---|
| 2015 | 1 | 5.00 |
| 2016 | 31 | 4.45 |
| 2017 | 86 | 4.24 |
| 2018 | 127 | 4.35 |
| 2019 | 118 | 4.12 |
| 2020 | 41 | 4.22 |
| 2021 | 53 | 4.00 |
| 2022 | 16 | 4.50 |
df1 %>% ungroup() %>%
group_by(review_year, review_quarter) %>%
summarise(num.reviews = n(), mean = mean(review_rating)) %>%
kbl(caption = "Average ratings by quarter", digits = 2, align = "l") %>%
kable_minimal(full_width = F)| review_year | review_quarter | num.reviews | mean |
|---|---|---|---|
| 2015 | 4 | 1 | 5.00 |
| 2016 | 1 | 3 | 5.00 |
| 2016 | 2 | 10 | 4.10 |
| 2016 | 3 | 9 | 4.22 |
| 2016 | 4 | 9 | 4.89 |
| 2017 | 1 | 13 | 4.38 |
| 2017 | 2 | 23 | 4.48 |
| 2017 | 3 | 16 | 4.00 |
| 2017 | 4 | 34 | 4.15 |
| 2018 | 1 | 27 | 4.22 |
| 2018 | 2 | 43 | 4.14 |
| 2018 | 3 | 34 | 4.62 |
| 2018 | 4 | 23 | 4.48 |
| 2019 | 1 | 32 | 3.88 |
| 2019 | 2 | 45 | 4.40 |
| 2019 | 3 | 24 | 4.12 |
| 2019 | 4 | 17 | 3.82 |
| 2020 | 1 | 22 | 4.27 |
| 2020 | 2 | 3 | 5.00 |
| 2020 | 3 | 7 | 4.43 |
| 2020 | 4 | 9 | 3.67 |
| 2021 | 1 | 6 | 4.83 |
| 2021 | 2 | 10 | 4.50 |
| 2021 | 3 | 16 | 3.62 |
| 2021 | 4 | 21 | 3.81 |
| 2022 | 1 | 4 | 3.50 |
| 2022 | 2 | 12 | 4.83 |
Over time, there are six quarters with average rating of less than 4. Two of those occurred in 2019 and four in the pandemic years (two of the four in 2021).
# Plot average quarterly ratings over time
ggplot(quarter,aes(x=year.quarter, y=mean, group = 1))+
geom_line(color = "#d95f02", size = 1) + # line color
theme_bw() +
theme(
panel.grid.major = element_line(colour = "#d8d8d8"), # color of grid lines in the plot area
plot.title = element_text(size = 18, face = "bold", margin=margin(0,0,20,0)),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5, size = 10),
axis.text.y = element_text(size = 10)) +
labs(x = "Year and Quarter") +
ggtitle("Average quarterly Google review ratings, 2015 to 2022") Note that if we fitted a straight line over the time period, we would have a downward sloping trend in average quarterly ratings.
# Plot average quarterly ratings over time and facet wrap by year
ggplot(quarter[quarter$review_year %in% c("2016", "2017", "2018", "2019", "2020", "2021"),],aes(x=review_quarter, y=mean, group=review_year))+
geom_line(color = "#d95f02", size = 2) + # line color
facet_wrap(~review_year, ncol = 2) +
theme_bw() +
theme(
# color of grid lines in the plot area
panel.grid.major = element_line(colour = "#d8d8d8"),
plot.title = element_text(size = 16, face = "bold", margin=margin(0,0,20,0)),
# size of individual facet labels
strip.text.x = element_text(size = 12),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5, size = 10), axis.text.y = element_text(size = 10)) +
ggtitle("Average quarterly Google review ratings, 2016 to 2021") +
labs(x = "Quarter") +
# x axis labels on all facets
facet_wrap2(vars(review_year), ncol = 2, axes = "all")It is interesting to note that 2019 and 2020 have a somewhat similar trajectory, but likely for very different reasons. 2020 was when the pandemic hit.
# weekly count of reviews & average weekly ratings
weekly <- df1 %>% group_by(review_year, review_week) %>% summarise(num.reviews = n(), mean = mean(review_rating)) %>% ungroup()
#225 obs
# number of weeks between first and last review
difftime(max(df1$review_date), min(df1$review_date), units = "weeks") #333 weeks
## Time difference of 332.9712 weeks
# percent of weeks with at least one review
round((nrow(weekly)/as.numeric(difftime(max(df1$review_date), min(df1$review_date), units = "weeks")))*100,2)
## [1] 67.57At least two third or 225 of the 333 weeks between the first and last review have at least one review rating.
Next we plot the number of reviews received, by week:
# create year-quarter label for x-axis
weekly$year.week <- paste0(weekly$review_year, " w", weekly$review_week)
# plot weekly counts over time
ggplot(weekly, aes(x=year.week, y=num.reviews)) +
geom_line(color = "#d95f02", size = 1, group = 1) +
ggtitle("Weekly count of reviews, Dec 2015 to May 2022") +
theme(axis.title.y = element_blank(), axis.title.x = element_blank(), axis.text.x = element_blank())From the plot above, there is a clear outlier. 2018 week 14 has 22 reviews. Taking a look at the review text might account for the presence of this outlier.
df1 %>% filter(review_week == "14" & review_year == "2018" & !(is.na(review_text))) %>% select(review_text)## review_text
## 1 I really like the food at Cocoa Grinder. However, when I visited on a Tuesday morning, I noticed there was just one person cooking everything. I had been waiting about 15 minutes after my order. The cook looked very put out / stressed when I asked if the dish that I saw was ready was in fact mine (as he walked by). He showed me the receipt of my order and asked if it was mine. I apologized for interrupting him and for potentially stressing him out and confirmed it was mine. He continued to look put out and silently made my dish (along with a few others simultaneously). I understand that cooking under pressure for a lot of people is stressful. What seems to be an under-staffing / training issue could be easily remedied.
## 2 Great place for lunch. Looks very healthy and natural. Highly recommended.
## 3 Amazing food amazing staff <3 <3 I ALWAYS come back here, and highly recommend to everyone !!!!!!!
## 4 They had a sign outside that made me cry...it was so sweet and I can’t even begin to explain how much it meant to me and I’m sure everyone else, like me, who saw it.
## 5 Nice place but the coffee reminds me of gas station coffee
## 6 the food and coffee was horrible.\nChange your religion to drink coffee
## 7 Thank you for your great Sign:)
## 8 One excellent place I have been there
## 9 the food and coffee was horrible.
## 10 Thank you for your kindness
## 11 Horrible place ..sucks
## 12 The bast 😊
An Instagram post verifies the spike in number of reviews for this week in 2018. A sign pasted on the store front attracted community attention and generated positive sentiment: https://www.instagram.com/p/BhIXNHjFK-l/
Next, we examine the frequency of poor ratings on a weekly basis, over time.
# number of weeks with subpar average rating
weekly %>% ungroup() %>% filter(mean < 4) %>% count(review_year) %>%
kbl(caption = "Weeks with avg. rating 3 and below", digits = 2, align = "l") %>%
kable_minimal(full_width = F)| review_year | n |
|---|---|
| 2016 | 3 |
| 2017 | 8 |
| 2018 | 6 |
| 2019 | 11 |
| 2020 | 6 |
| 2021 | 8 |
| 2022 | 1 |
2019 seems to have the most number of weeks (11) with poor ratings. Below, are the weeks listed. There seems to be quite an even dispersion throughout the year as to when the store received poor ratings. Cocoa Grinder received 1,2,3 star ratings on the following weeks:
weekly %>% filter(review_year == "2019" & mean < 4) %>% select(review_week)
## # A tibble: 11 × 1
## review_week
## <chr>
## 1 00
## 2 02
## 3 07
## 4 15
## 5 20
## 6 31
## 7 38
## 8 39
## 9 45
## 10 48
## 11 51For the final portion of the descriptive analysis, we focus our attention on monthly reviews.
# count of reviews by year & month
count.yr.mth <- df1 %>% group_by(review_year, review_month) %>% summarise(num.reviews = n())
# count months with at least one review
count.yr.mth %>%
ungroup %>%
group_by(review_year) %>%
summarise(n()) %>%
kbl(caption = "Months with at least one review", digits = 2, align = "l") %>%
kable_minimal(full_width = F)| review_year | n() |
|---|---|
| 2015 | 1 |
| 2016 | 10 |
| 2017 | 12 |
| 2018 | 12 |
| 2019 | 12 |
| 2020 | 10 |
| 2021 | 11 |
| 2022 | 4 |
Every year from 2016 to 2021 had at least 10 months with reviews.
# counts and lists the popular months for reviews from 2015 to 2021
df1 %>% ungroup() %>%
group_by(review_year, review_month) %>%
summarise(n=n()) %>%
ungroup() %>%
slice_max(n, n=20) %>%
count(review_month) %>%
kbl(caption = "Months with at least one review", digits = 2, align = "l") %>%
kable_minimal(full_width = F)| review_month | n |
|---|---|
| April | 4 |
| August | 2 |
| December | 3 |
| February | 2 |
| January | 3 |
| July | 4 |
| June | 1 |
| March | 2 |
| May | 2 |
| November | 1 |
| October | 2 |
| September | 2 |
In the top 20 months (from December 2015 to May 2022) with the most reviews, April is mentioned 4 times. April seems to be the most common month to submit reviews and understandably so because the weather is improving and generally putting more people in a mood to write about their dining experiences.
Let us now look at plots of monthly data. In order to plot monthly data, we first need to create a month-year placeholder data frame of time units from 2015 to 2022, and join it to the Google review data.
Data prep
# create year-month placeholder
yr.mth.df <- tibble(date = seq(as.Date("2015-12-01"), as.Date("2022-05-31"), by = "1 month")) #78 obs
# create additional time variables
yr.mth.df <- yr.mth.df %>%
mutate(year = format(date, "%Y"),
quarter = quarter(date),
month = format(date, "%B"),
month_n = format(date, "%m"))
# join the monthly counts of reviews to placeholder
count.yr.mth <- left_join(yr.mth.df, count.yr.mth, by = c("year" = "review_year", "month" = "review_month"))
# replace NA with 0 (ie months with no review)
count.yr.mth <- count.yr.mth %>% mutate_at(vars(num.reviews), ~replace_na(., 0))glimpse(count.yr.mth) # data preview
## Rows: 78
## Columns: 6
## $ date <date> 2015-12-01, 2016-01-01, 2016-02-01, 2016-03-01, 2016-04-0…
## $ year <chr> "2015", "2016", "2016", "2016", "2016", "2016", "2016", "2…
## $ quarter <int> 4, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1, 1, 1, 2, 2, 2, 3…
## $ month <chr> "December", "January", "February", "March", "April", "May"…
## $ month_n <chr> "12", "01", "02", "03", "04", "05", "06", "07", "08", "09"…
## $ num.reviews <int> 1, 1, 2, 0, 4, 3, 3, 3, 6, 0, 5, 1, 3, 3, 5, 5, 10, 6, 7, …# re-format month for ggplotting
count.yr.mth$month <- lubridate::month(count.yr.mth$date, label=TRUE)
# ggplot go brrr
ggplot(count.yr.mth[count.yr.mth$year %in% c("2016", "2017", "2018", "2019", "2020", "2021"),],aes(x=month, y=num.reviews, group=year))+
geom_line(color = "#d95f02", size = 1.5) + # line color
facet_wrap(~year, ncol = 2) +
theme_bw() +
theme(
panel.grid.major = element_line(colour = "#d8d8d8"), # color of grid lines in the plot area
plot.title = element_text(size = 16, face = "bold", margin=margin(0,0,20,0)),
strip.text.x = element_text(size = 12),
axis.title.y = element_blank(), axis.title.x = element_blank(),
axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.5, size = 10),
axis.text.y = element_text(size = 12)) +
ggtitle("Monthly count of Google reviews, 2016 to 2021") +
labs(x = "") +
# x axis labels on all facets
facet_wrap2(vars(year), ncol = 2, axes = "all")There are three years where no month in that year received more than ten reviews: 2016 (first year of operation), 2020 and 2021 (the pandemic years). Between 2017 and 2019, there are approximately an average of two to three months within the year that receive more than ten reviews.
# create df of average monthly ratings
yr.month.avg <- df1 %>% ungroup() %>%
group_by(review_year, review_month) %>%
summarise(reviews = n(), mean = mean(review_rating)) %>%
#create year_month concatenated and format as date
mutate(year_month = paste(review_year,review_month,sep = "m"),
date = ym(year_month))
summary(yr.month.avg$mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.402 4.305 4.733 5.000
# histogram
hist(yr.month.avg$mean, main = "Distribution of average monthly ratings", xlab = "Monthly average")# Plot average monthly ratings over time
ggplot() +
geom_line(data = yr.month.avg, aes(x = date, y = mean)) +
labs(x = "Year", y = "Mean Monthly Rating") +
ggtitle("Monthly ratings over time")We can create a more informative plot of average monthly ratings broken down by year
# join to placeholder the df that contains both counts of reviews and average review rating per month ...
yr.month.avg <- left_join(yr.mth.df, yr.month.avg, by = c("date", "year" = "review_year", "month" = "review_month"))
# pre-plotting revision to the data
yr.month.avg <- yr.month.avg %>%
# replace NA with 0 (ie rating for months with no review)
mutate_at(vars(reviews, mean), ~replace_na(., 0)) %>%
# change format of month for ggplotting
mutate(month = lubridate::month(date, label=TRUE))# plot average monthly Google review rating and facet wrap by year
ggplot(yr.month.avg[!(yr.month.avg$year %in% c("2015")),],aes(x=month, y=mean, group=year))+
geom_line(color = "#d95f02", size = 2) +
facet_wrap(~year, ncol = 2) +
theme_bw() +
ggtitle("Average monthly rating, 2016 to 2022") +
theme(panel.grid.major = element_line(colour = "#d8d8d8"),
# plot title font style
plot.title = element_text(size = 18, face = "bold", margin=margin(0,0,20,0)),
# facet titles font size
strip.text.x = element_text(size = 14),
# blank axes titles
axis.title.y = element_blank(), axis.title.x = element_blank(),
# adjusting x and y axis markers font style
axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.5, size = 10), axis.text.y = element_text(size = 10)) +
# arranges the facets by year, into two cols
facet_wrap2(vars(year), axes = "all", ncol = 2) There seems to be no consistent trend in average monthly ratings across the years. There are two months in 2016 - Cocoa Grinder’s first year of operations - with an average rating of 0. These are the two months with zero reviews as seen in the table earlier. The months in 2017, 2018 and 2019 do not fluctuate that much in terms of average monthly ratings.
Note that in some plots, 2015 & 2022 data has been removed because they are not full years and would distract by creating too many plots
# Plot stacked bar chart and facet-wrap by year
df2 <- df2 %>% filter(!(review_year %in% c("2015", "2022")))
#
ggplot(df2[!(df2$review_year %in% c("2015", "2022")),]) +
aes(x=review_month,fill=as.character(review_rating)) +
geom_bar(position="fill", width = 0.6) +
scale_fill_manual(values = c("#a6611a", "#dfc27d", "#6b6b6b", "#80cdc1", "#018571")) +
coord_flip() +
theme(
# color of plot area lines
panel.grid.major = element_line(colour = "#d8d8d8"),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, face = "italic", margin=margin(0,0,20,0)),
strip.text.x = element_text(size = 14), # increase size of facet labels
axis.title.y = element_blank(), axis.title.x = element_blank(),
axis.text.x = element_blank(), axis.text.y = element_text(size = 10),
legend.position = "bottom", legend.key.size = unit(0.75, 'cm'), legend.direction = "horizontal",
legend.title = element_text(size=10), legend.text = element_text(size=10)) +
facet_wrap(~review_year) +
# facet_grid(~review_year, scales = "free_x", space = "free_x") + #this did not do anything to increase space
labs(fill = "Review Rating") +
ggtitle("Proportion of reviews by rating for each month, 2016 to 2022",
subtitle = "E.g: In Dec 2017, there were 33% 1-star, 33% 4-star and 33% 5-star reviews") +
facet_wrap2(vars(review_year), axes = "all") There are several basic steps to quantifying the sentiment of each review. We essentially identify words within a review that have positive or negative sentiment and then count them. Each review is assigned a sentiment value after we calculate the number of net positive or negative words. Reviews with a sentiment value score of more than zero are on the whole positive (ie they contain more positive words than negative words) and reviews with a negative score have overall negative sentiment.
205 or 43 percent of reviews do not have accompanying text while 268 or 57 percent are text reviews
# Percentage of reviews with text for each rating level
df1 %>% group_by(review_rating) %>%
summarise(
reviews = n(),
reviews_no_text = sum(is.na(review_text)),
reviews_with_text = reviews - reviews_no_text,
pct = (reviews_with_text/reviews)*100) %>%
kbl(caption = "Percentage of reviews with text for each rating level ", digits = 2, align = "l") %>%
kable_minimal(full_width = F)| review_rating | reviews | reviews_no_text | reviews_with_text | pct |
|---|---|---|---|---|
| 1 | 32 | 7 | 25 | 78.12 |
| 2 | 20 | 3 | 17 | 85.00 |
| 3 | 36 | 17 | 19 | 52.78 |
| 4 | 102 | 44 | 58 | 56.86 |
| 5 | 283 | 134 | 149 | 52.65 |
There are 52 one and two star ratings over time. Notice that a higher proportion of one and two star reviews are accompanied by text compared to reviews with ratings of 3 and above. 42 of the 52 (or 81 %) one and two star reviews contain text compared to 54 % of three, four and five star reviews.
Unnest tokens
The next step in sentiment analysis is to unnest tokens which means unspooling our corpus of text into one word per row. After unnesting tokens, we end up with a data frame where each row represents every word in a review.
# restructure data
df2 <- unnest_tokens(tbl=df1, input=review_text, output = word) # 7895 obs (words)
df2 %>% head() %>% kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_id | review_rating | review_datetime_utc | review_date | review_year | review_month | review_week | review_quarter | word |
|---|---|---|---|---|---|---|---|---|
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | on |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | of |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | my |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | favorite |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | cafes |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | to |
Given that there are 7895 words in the corpus of Google review text, our next task will be the remove stop words from our text corpus.
Word count
# histogram of review-level word count
df2 %>% ungroup() %>%
group_by(review_id) %>%
summarise(word.count = n()) %>%
ungroup() %>%
ggplot(aes(word.count)) +
geom_histogram() + ggtitle("Distribution of review-level word count")# breakdown of number of reviews with more than 100 words by rating score
df2 %>%
group_by(review_rating, review_id) %>%
summarise(word.count = n()) %>%
filter(word.count >= 100) %>%
count(review_rating)## # A tibble: 5 × 2
## # Groups: review_rating [5]
## review_rating n
## <int> <int>
## 1 1 5
## 2 2 2
## 3 3 3
## 4 4 5
## 5 5 1
It is interesting that there are five one-star and five four-star reviews with more than a hundred words. It seems that reviews which have a lot to say about the cafe could be either lowly or highly rated.
Remove stop words
Stop words are commonly used words in a given language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. We need to eliminate these words from the corpus of text we are analyzing because they do not indicate sentiment and are of no use to us for this analysis.
Before the anti_join we had a data frame with 7895
words, which was the total number of words in our corpus. After removing
the stop words, we are left with 3868 words or rows.
df3 %>% head() %>% kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_id | review_rating | review_datetime_utc | review_date | review_year | review_month | review_week | review_quarter | word |
|---|---|---|---|---|---|---|---|---|
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | favorite |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | cafes |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | brooklyn |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | staff |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | super |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | kind |
Word frequency
Next we shall take a look at the top three most commonly used word for each rating level
# top 3 common words
df3 %>%
filter(!(is.na(word))) %>% # filter out the NA's
group_by(review_rating) %>%
count(word) %>%
arrange(review_rating, desc(n)) %>%
slice_max(n, n=3) %>%
kbl(caption = "Top 3 words used, by rating level", align = "l") %>% kable_minimal(full_width = F)| review_rating | word | n |
|---|---|---|
| 1 | coffee | 15 |
| 1 | food | 8 |
| 1 | order | 8 |
| 2 | food | 11 |
| 2 | service | 6 |
| 2 | milk | 5 |
| 3 | food | 12 |
| 3 | place | 12 |
| 3 | good | 7 |
| 4 | good | 43 |
| 4 | food | 33 |
| 4 | place | 29 |
| 5 | place | 69 |
| 5 | great | 68 |
| 5 | food | 63 |
Even before identifying the sentiment contained in each review, we can get some sense of why someone might have given a poor ratings. Notice the repetitiveness of the top three words for reviews with ratings 3 and above: “good” or “great”, “place” and “food”. In contrast, poor ratings or one or two star reviews mention the most fundamental aspects of dining out like cleanliness & service manners etc.
Join to sentiment
Next, we join the remaining corpus of text to a list of words with labelled sentiment.
# import list of words with sentiment
bing <- get_sentiments(lexicon = "bing") # 6786 obs (words)
df4 <- inner_join(df3, bing, by="word") # 898 obs (words)In the resulting data frame, one row contains each positive or negative word per review
df4 %>% head() %>% kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_id | review_rating | review_datetime_utc | review_date | review_year | review_month | review_week | review_quarter | word | sentiment |
|---|---|---|---|---|---|---|---|---|---|
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | favorite | positive |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | super | positive |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | strong | positive |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | crazy | negative |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | inexpensive | positive |
| ChZDSUhNMG9nS0VJQ0FnSUMydTdTN1RBEAE | 5 | 4/28/2022 20:09 | 2022-04-28 20:09:00 | 2022 | April | 17 | 2 | great | positive |
df4 %>% group_by(sentiment) %>% summarise(n_distinct(review_id)) %>%
kbl(caption = "Reviews containing either pos. or neg. sentiment", align = "l") %>%
kable_minimal(full_width = F)| sentiment | n_distinct(review_id) |
|---|---|
| negative | 100 |
| positive | 242 |
There are 100 reviews with negative sentiment and 242 reviews with positive sensitive. Note that these do not sum to 260 reviews because some reviews contain both positive and negative sentiment. In the next step we will calculate the net sentiment per review.
Quantifying sentiment
Finally, we count up the instances a positive or negative sentiment is identified in a review.
This data frame sums the number of instances a positive or negative sentiment word is identified in each review. Some reviews contain only positive or only negative sentiment words while some reviews contain both sentiment.
df5 <- df4 %>% ungroup() %>% group_by(review_id,sentiment) %>% summarise(count=n()) #342 obs
df5 %>% head() %>% kbl(caption = "Data preview of pos. and neg. sentiment per review", align = "l") %>% kable_minimal(full_width = F)| review_id | sentiment | count |
|---|---|---|
| ChdDSUhNMG9nS0VJQ0FnSUM0dWNtVnFRRRAB | positive | 1 |
| ChdDSUhNMG9nS0VJQ0FnSUM4azlTcTRBRRAB | positive | 2 |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | negative | 5 |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | positive | 2 |
| ChdDSUhNMG9nS0VJQ0FnSUMwb016cXFRRRAB | positive | 1 |
| ChdDSUhNMG9nS0VJQ0FnSUMwcTdlbDVnRRAB | negative | 2 |
Reshape to review-level
Finally, we will reshape the data so that one row represents one review
#
df6 <- df5 %>%
# reshape so that one row represents one review (NA values here are assigned 0)
pivot_wider(names_from = sentiment, values_fr = count, values_fill = 0) %>%
# calculate net sentiment
mutate(sentiment.score = positive - negative)
df6 %>% head() %>% kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_id | positive | negative | sentiment.score |
|---|---|---|---|
| ChdDSUhNMG9nS0VJQ0FnSUM0dWNtVnFRRRAB | 1 | 0 | 1 |
| ChdDSUhNMG9nS0VJQ0FnSUM4azlTcTRBRRAB | 2 | 0 | 2 |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | 2 | 5 | -3 |
| ChdDSUhNMG9nS0VJQ0FnSUMwb016cXFRRRAB | 1 | 0 | 1 |
| ChdDSUhNMG9nS0VJQ0FnSUMwcTdlbDVnRRAB | 3 | 2 | 1 |
| ChdDSUhNMG9nS0VJQ0FnSUMwLXJmUWlBRRAB | 2 | 4 | -2 |
Example
Here is an example of a review that received an overall sentiment score of -2.
df6 %>% filter(review_id == "ChdDSUhNMG9nS0VJQ0FnSUN3LUxMbTR3RRAB")
## # A tibble: 1 × 4
## # Groups: review_id [1]
## review_id positive negative sentiment.score
## <chr> <int> <int> <int>
## 1 ChdDSUhNMG9nS0VJQ0FnSUN3LUxMbTR3RRAB 3 5 -2
df4 %>% filter(review_id == "ChdDSUhNMG9nS0VJQ0FnSUN3LUxMbTR3RRAB") %>% select(word, sentiment)
## word sentiment
## 1 sucks negative
## 2 rude negative
## 3 work positive
## 4 loud negative
## 5 annoying negative
## 6 pretty positive
## 7 delicious positive
## 8 negative negative
df1 %>% filter(review_id == "ChdDSUhNMG9nS0VJQ0FnSUN3LUxMbTR3RRAB") %>% select(review_text)
## review_text
## 1 First off, the service here sucks. Employees are rude and it is evident they don't want to be there. Both employees today had their kids at work and they were very loud and annoying to all the customers.They never update the menu when they no longer serve something; i had to order 3 things before she finally said they served that item. I will say that the food and drinks I have gotten here are pretty delicious, but the overall experience here was negative and the food alone couldn't make up for that.Here we examine the breakdown of sentiment scores and the number of reviews that attained a given score. The sentiment score ranged from -3 to 20.
df6 %>%
pull(sentiment.score) %>%
table(., useNA = "always") %>%
kbl(caption = "Sentiment score", align = "l") %>% kable_minimal(full_width = F)| . | Freq |
|---|---|
| -3 | 5 |
| -2 | 9 |
| -1 | 17 |
| 0 | 21 |
| 1 | 69 |
| 2 | 49 |
| 3 | 37 |
| 4 | 21 |
| 5 | 9 |
| 6 | 10 |
| 7 | 6 |
| 8 | 2 |
| 9 | 3 |
| 18 | 1 |
| 20 | 1 |
| NA | 0 |
When looking at reviews in the aggregate, it seems that there is overwhelmingly positive sentiment.
It is worth noting a limitation here: For instance, Bing allocates
sentiments to words like “impossible” (negative) and “seamless”
(positive). The former refers to the name of the burger on the store’s
menu while “seamless” is the food ordering service. Additionally, bing
did not allocate a sentiment to “rudeness” (“rudeness” is present in the
word col of df3 but it is not present in
df4 which contains words with labelled sentiment). We can
resolve this by customizing the list of words with labelled sentiment,
either removing or adding words specific to the corpus of text being
analysed.
Join sentiment score to word-level data frame
# join to `df4` where each row is a word assigned either positive or negative within a review
df.words <- left_join(df6, df4, by = "review_id")
df.words %>% head() %>% kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_id | positive | negative | sentiment.score | review_rating | review_datetime_utc | review_date | review_year | review_month | review_week | review_quarter | word | sentiment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ChdDSUhNMG9nS0VJQ0FnSUM0dWNtVnFRRRAB | 1 | 0 | 1 | 5 | 6/9/2019 21:07 | 2019-06-09 21:07:00 | 2019 | June | 22 | 2 | delicious | positive |
| ChdDSUhNMG9nS0VJQ0FnSUM4azlTcTRBRRAB | 2 | 0 | 2 | 5 | 7/12/2020 22:32 | 2020-07-12 22:32:00 | 2020 | July | 27 | 3 | nice | positive |
| ChdDSUhNMG9nS0VJQ0FnSUM4azlTcTRBRRAB | 2 | 0 | 2 | 5 | 7/12/2020 22:32 | 2020-07-12 22:32:00 | 2020 | July | 27 | 3 | recommend | positive |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | 2 | 5 | -3 | 1 | 7/7/2020 19:19 | 2020-07-07 19:19:00 | 2020 | July | 27 | 3 | impossible | negative |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | 2 | 5 | -3 | 1 | 7/7/2020 19:19 | 2020-07-07 19:19:00 | 2020 | July | 27 | 3 | sweet | positive |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | 2 | 5 | -3 | 1 | 7/7/2020 19:19 | 2020-07-07 19:19:00 | 2020 | July | 27 | 3 | cold | negative |
# create a row index by incident_id
df.words <- df.words %>%
group_by(review_id) %>%
mutate(row.index = row_number()) %>%
relocate(review_id, row.index)Frequency of negative-sentiment words in one, two and three star reviews
df.words %>%
ungroup() %>%
filter(review_rating %in% c(1,2,3) & sentiment == "negative") %>%
count(word) %>%
arrange(desc(n)) %>%
head(.,10) %>%
kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| word | n |
|---|---|
| horrible | 7 |
| rude | 6 |
| bad | 5 |
| worst | 5 |
| cold | 4 |
| slow | 4 |
| cheap | 3 |
| expensive | 3 |
| overpriced | 3 |
| sucks | 3 |
Top three negative-sentiment words for each level of review rating
df.words %>%
ungroup() %>%
filter(sentiment == "negative") %>%
group_by(review_rating) %>%
count(word) %>%
slice_max(n=3, order_by = n) %>%
kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_rating | word | n |
|---|---|---|
| 1 | horrible | 6 |
| 1 | rude | 5 |
| 1 | worst | 5 |
| 2 | slow | 3 |
| 2 | dirty | 2 |
| 2 | lacks | 2 |
| 2 | sucks | 2 |
| 3 | bad | 4 |
| 3 | cheap | 3 |
| 3 | expensive | 3 |
| 4 | chill | 4 |
| 4 | cold | 3 |
| 4 | expensive | 2 |
| 4 | overpriced | 2 |
| 4 | pricey | 2 |
| 5 | chill | 6 |
| 5 | hang | 5 |
| 5 | cold | 4 |
One or two star reviews have gripes with the most fundamental aspects of the store like cleanliness & service manners etc.
df.words <- left_join(df6, df4, by = "review_id")
df.words %>% head() %>% kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_id | positive | negative | sentiment.score | review_rating | review_datetime_utc | review_date | review_year | review_month | review_week | review_quarter | word | sentiment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ChdDSUhNMG9nS0VJQ0FnSUM0dWNtVnFRRRAB | 1 | 0 | 1 | 5 | 6/9/2019 21:07 | 2019-06-09 21:07:00 | 2019 | June | 22 | 2 | delicious | positive |
| ChdDSUhNMG9nS0VJQ0FnSUM4azlTcTRBRRAB | 2 | 0 | 2 | 5 | 7/12/2020 22:32 | 2020-07-12 22:32:00 | 2020 | July | 27 | 3 | nice | positive |
| ChdDSUhNMG9nS0VJQ0FnSUM4azlTcTRBRRAB | 2 | 0 | 2 | 5 | 7/12/2020 22:32 | 2020-07-12 22:32:00 | 2020 | July | 27 | 3 | recommend | positive |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | 2 | 5 | -3 | 1 | 7/7/2020 19:19 | 2020-07-07 19:19:00 | 2020 | July | 27 | 3 | impossible | negative |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | 2 | 5 | -3 | 1 | 7/7/2020 19:19 | 2020-07-07 19:19:00 | 2020 | July | 27 | 3 | sweet | positive |
| ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB | 2 | 5 | -3 | 1 | 7/7/2020 19:19 | 2020-07-07 19:19:00 | 2020 | July | 27 | 3 | cold | negative |
Top 3 positive sentiment words by review rating
df.words %>% ungroup() %>%
filter(sentiment == "positive") %>%
group_by(review_rating) %>%
count(word) %>%
slice_max(n=3, order_by = n) %>%
kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_rating | word | n |
|---|---|---|
| 1 | refund | 4 |
| 1 | friendly | 3 |
| 1 | fine | 2 |
| 1 | fresh | 2 |
| 1 | nice | 2 |
| 1 | recommend | 2 |
| 1 | super | 2 |
| 1 | sweet | 2 |
| 1 | work | 2 |
| 2 | good | 4 |
| 2 | nice | 4 |
| 2 | hot | 2 |
| 2 | recommend | 2 |
| 2 | work | 2 |
| 3 | good | 7 |
| 3 | nice | 6 |
| 3 | great | 5 |
| 4 | good | 43 |
| 4 | nice | 23 |
| 4 | great | 21 |
| 5 | great | 68 |
| 5 | good | 36 |
| 5 | love | 27 |
Common positive-sentiment words in low-rating reviews are words like “refund” and “work”
In contrast, positive words that appear most frequently in three, four and five star reviews are generic words like “good”, “great” which are reflective of an experience at BDC that was overall positive, whereas negative reviews tend to be motivated by and describe a specific experience the reviewer had at the cafe.
Overall imitations: * There is not a very large corpus of text here to parse. * Positive reviews by are large do not mention a lot and negative (one to two star) reviews tend to be emotion-driven.
df.words %>% ungroup() %>%
filter(sentiment == "positive") %>%
group_by(review_rating) %>%
count(word) %>%
slice_max(n=5, order_by = n) %>%
kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_rating | word | n |
|---|---|---|
| 1 | refund | 4 |
| 1 | friendly | 3 |
| 1 | fine | 2 |
| 1 | fresh | 2 |
| 1 | nice | 2 |
| 1 | recommend | 2 |
| 1 | super | 2 |
| 1 | sweet | 2 |
| 1 | work | 2 |
| 2 | good | 4 |
| 2 | nice | 4 |
| 2 | hot | 2 |
| 2 | recommend | 2 |
| 2 | work | 2 |
| 3 | good | 7 |
| 3 | nice | 6 |
| 3 | great | 5 |
| 3 | decent | 3 |
| 3 | pretty | 3 |
| 4 | good | 43 |
| 4 | nice | 23 |
| 4 | great | 21 |
| 4 | delicious | 8 |
| 4 | healthy | 8 |
| 5 | great | 68 |
| 5 | good | 36 |
| 5 | love | 27 |
| 5 | delicious | 24 |
| 5 | friendly | 23 |
| 5 | nice | 23 |
# to get the date join df6 to df1
df6 <- df6 %>% ungroup() ; df1 <- df1 %>% ungroup()
df6a <- left_join(df6, df1, by = "review_id")Using sentiment score (net sentiment), we count the number of positive, negative and neutral reviews
df6a <- df6a %>% mutate(sentiment.value =
case_when(sentiment.score > 0 ~ "positive",
sentiment.score == 0 ~ "neutral",
sentiment.score < 0 ~ "negative"))
n_distinct(df6$review_id) # 260 reviews ## [1] 260
table(df6a$sentiment.value, useNA = "always") ##
## negative neutral positive <NA>
## 31 21 208 0
A majority of reviews, 208 out of 260, have net-positive sentiment
Next we plot the sentiment score over time, on a monthly basis
# Create data frame with monthly frequency of positive, negative and neutral reviews
freq.sent.value <- df6a %>%
group_by(review_year, review_month, sentiment.value) %>%
count() %>%
pivot_wider(names_from = sentiment.value, values_from = n, values_fill = 0)freq.sent.value %>%
rowwise() %>%
filter(negative > positive) %>% # 2020 oct, 2021 july & 2021 nov
kbl(caption = "", align = "l") %>% kable_minimal(full_width = F)| review_year | review_month | neutral | positive | negative |
|---|---|---|---|---|
| 2020 | October | 0 | 0 | 1 |
| 2021 | July | 0 | 2 | 4 |
| 2021 | November | 0 | 1 | 2 |
Months with more negative reviews than positive reviews are all pandemic months
Plot of sentiment score over time
ggplot(df6a, aes(x=as.Date(review_date))) +
geom_col(aes(y = sentiment.score, fill = sentiment.value, color = sentiment.value), size = 1.1) +
# specify colors for three sentiment values
scale_colour_manual(values = c("lightcoral", "black", "lightsteelblue3")) +
labs(x = NULL,y = NULL, title="Sentiment scores from Dec 2015 to May 2022") +
# draw a horizontal line separating good reviews from bad
geom_hline(yintercept = 0, color = "black") +
# adjust the legend
theme(legend.position = "top", legend.key.size = unit(0.5, 'cm'), legend.direction = "horizontal",
legend.title = element_blank(), legend.text = element_text(size=12),
legend.key = element_rect(fill = "white")) +
# remove the grey fill in the legend key
guides(color = guide_legend(override.aes = list(fill = c("lightcoral", "black", "lightsteelblue3")))) +
# customize the x axis time breaks
scale_x_date(date_breaks = "6 months", date_labels = "%Y-%b")Correlation between rating and sentiment score
round(cor(df6a$review_rating, df6a$sentiment.score), 2) # 0.43
## [1] 0.43
# round(((cor.test(df6a$review_rating, df6a$sentiment.score))$p.value), 10)
cor.test(df6a$review_rating, df6a$sentiment.score)$p.value < 0.05 #TRUE
## [1] TRUE# GENERAL LIMITATIONS
# Limitations: bing did not allocate a sentiment to "rudeness" (filter `word` in `df3` and "rudeness" is there but it's not in `df4`)
# df1 %>% filter(review_id == "ChdDSUhNMG9nS0VJQ0FnSUM4M3ItdWtRRRAB") %>% select(review_text)
# Limitations: ALSO, WHERE IS THE OTHER POSITIVE SENTIMENT WORD IN THIS REVIEW BESIDES SEAMLESS
# INCLUDE SOME EXAMPLES HERE -- maybe look up the reviews that mention these words idk
# WORD COUNT BY RATING
# done - WHAT ABOUT WORD COUNT BY SENTIMENT
# done - Possible qns: for 1 & 2 star reviews, summarise the negative sentiment expressed to identify what was the bad review ABOUT
# Theme collections of words: e.g. food (omelette, coffee, bagel)