The purpose of this analysis

This analysis compares the determinants of guest experience at the top 5 score luxury restaurants in Bucharest Romania before and during the pandemic-COVID-19.In particular, 1048 reviews were posted before the pandemic outbreak and 58 reviews were posted during the pandemic. The data was collected using web scrapping techniques on the TripAdvisor website for the following restaurants: Kaiamo, Casa di David, L’Atelier, The Artist and Relais & Chateaux Le Bistrot Francais.

Exploratory Data Analysis

1. Loading the relevant packages

For this analysis the tidyverse was used and some packages for text mining techniques.

library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
library(tidytext)
library(tidyverse)
library(stringr)
library(tidyr)
library(scales)
library(broom)
library(purrr)
library(forcats)
library(widyr)
library(igraph)
library(ggraph)
library(SnowballC)
library(wordcloud)
library(reshape2)
library(textdata)
library(gt)
library(quanteda)
library(quanteda.textplots)

2. Reading the data

The reviews were scraped from TripAdvisor. The script was run up in April 2021 so new reviews might have been added in the meantime on the website.

df <- read_csv("Top5_luxury_restaurant_Bucharest.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   data = col_character(),
##   review = col_character(),
##   restaurant = col_character()
## )

3. Checking the complete cases and transform the date variable from char to date

The csv file contains 3 string columns: date, review and restaurant. The date column was transformed from string to date variable for further analysis.

df <- df[complete.cases(df), ]
df$data <- as.Date(df$data, format = "%B %d %Y")

4. Some statistics

The dimension of the data is 1106 observations x 3 columns. The earliest review is on 19th May 2010 and the latest one on 8th March 2021

# checking the dimension of the data
dim(df)
## [1] 1106    3
# earliest review
min(df$data) 
## [1] "2010-05-19"
#latest review
max(df$data)
## [1] "2021-03-08"

5. Arrange the data

We arranged the data from latest to oldest review in order to better have an overview of the data.

# Arrange the data from latest review to oldest
df <- df %>%
  arrange(desc(data))
head(df)
## # A tibble: 6 x 3
##   data       review                                                 restaurant  
##   <date>     <chr>                                                  <chr>       
## 1 2021-03-08 "What a delightful and beautiful place in the heart o~ L'Atelier   
## 2 2021-01-17 "A good restaurant has the same quality of the food y~ Casa di Dav~
## 3 2021-01-07 "don't usually leave reviews here, but this place is ~ Casa di Dav~
## 4 2021-01-07 "Incredibly rude staff, how can you ask people to lea~ Casa di Dav~
## 5 2021-01-01 "First of all, it\x92s a good thing that it\x92s open~ L'Atelier   
## 6 2020-12-31 "I always love going back to Casa di David when visit~ Casa di Dav~

6. First and last review by each restaurant

We grouped the data by each restaurant to check the first and the last review by each restaurant. We spot that Casa di David has the earliest review overall being the first luxury restaurant opened among the top 5. The most recent opened restaurant is Kaiamo in our dataset.

# Earliest/latest review by each restaurant
df %>%
  group_by(restaurant) %>%
  summarize(min_date = min(data),
            max_date = max(data)) %>%
  gt()
restaurant min_date max_date
Casa di David 2010-05-19 2021-01-17
Kaiamo 2018-09-14 2020-12-29
L'Atelier 2014-07-27 2021-03-08
Relais & Chateaux Le Bistrot Francais 2014-05-24 2020-10-17
The Artist 2017-08-25 2020-10-11

7. Checking the popularity among reviews by each restaurant

The most popular restaurant among top 5 is The Artist having the most reviews, followed by Relais & Chateaux and Casa di David. The most recent opened restaurant, Kaiamo has the lowest count of reviews.

# Count of review per restaurant overall
df %>%
  group_by(restaurant) %>%
  count(restaurant) %>%
  arrange(n) %>%
  ggplot(aes(reorder(restaurant,-n),n), fill = n)+
  geom_bar(stat="identity",fill = "#FF6666") +
  labs(title = "Count of reviews by restaurant",
       y = "# of reviews", x = "Restaurants")+
  theme(plot.title = element_text(hjust = 0.5))

8. The reviews before pandemic outbreak

Most reviews were written on the website before the pandemic outbreak.

# Count of review per restaurant before pandemic
df %>%
  group_by(restaurant) %>%
  filter(data < "2020-01-01") %>%
  count(restaurant) %>%
  arrange(n) %>%
  ggplot(aes(reorder(restaurant,-n),n), fill = n)+
  geom_bar(stat="identity",fill = "#FF6666") +
  labs(title = "Count of reviews by restaurant before pandemic",
       y = "# of reviews", x = "Restaurants")+
  theme(plot.title = element_text(hjust = 0.5))

9. The reviews in pandemic outbreak

Howevevr, we can see that during the pandemic outbreak there were also reviews added on the website that we are going to analyze further. We can see that Kaiamo gained popularity in pandemic outbreak and Relais & Chateaux lost popularity.

# Count of review per restaurant in pandemic
df %>%
  group_by(restaurant) %>%
  filter(data >= "2020-01-01") %>%
  count(restaurant) %>%
  arrange(n) %>%
  ggplot(aes(reorder(restaurant,-n),n), fill = n)+
  geom_bar(stat="identity",fill = "#FF6666") +
  labs(title = "Count of reviews by restaurant in pandemic",
       y = "# of reviews", x = "Restaurants")+
  theme(plot.title = element_text(hjust = 0.5))

10 Let’s check the monthly reviews overall

We can see in the below plot that most of the reviews were written from 2018 to 2020, just before the pandemic outbreak.

# Number of reviews per month
df %>%
  count(Month = round_date(data, "month")) %>%
  ggplot(aes(Month, n)) +
  geom_line() + 
  ggtitle('The Number of Reviews Per Month')+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_date(date_breaks = "1 year",date_labels = "%Y")

11. Let’s check the monthly reviews by each restaurant

Casa di David,L’Atelier and Relais & Chateaux have a constant number of reviews, while Kaiamo and The Artist suffered a lot in pandemic regarding given reviews on TripAdvisor

# Number of reviews per month per restaurant
df %>%
  count(Month = round_date(data, "month"),restaurant) %>%
  ggplot(aes(Month, n)) +
  geom_line() + 
  ggtitle('The Number of Reviews Per Month by Restaurant')+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_x_date(date_breaks = "3 year",date_labels = "%Y") +
  facet_wrap(~ restaurant)

Text Analysis

12. Assigning IDs and mutate the month column

We will assign an unique id for each row in our data set for further purpose. We also create a month column.

# ID assign
df <- tibble::rowid_to_column(df, "ID")

# Mutate the month column
df <- df %>%
  mutate(data = as.POSIXct(data, origin = "1970-01-01"),month = round_date(data, "month"))

head(df)
## # A tibble: 6 x 5
##      ID data                review                restaurant month              
##   <int> <dttm>              <chr>                 <chr>      <dttm>             
## 1     1 2021-03-08 03:00:00 "What a delightful a~ L'Atelier  2021-03-01 00:00:00
## 2     2 2021-01-17 03:00:00 "A good restaurant h~ Casa di D~ 2021-02-01 00:00:00
## 3     3 2021-01-07 03:00:00 "don't usually leave~ Casa di D~ 2021-01-01 00:00:00
## 4     4 2021-01-07 03:00:00 "Incredibly rude sta~ Casa di D~ 2021-01-01 00:00:00
## 5     5 2021-01-01 03:00:00 "First of all, it\x9~ L'Atelier  2021-01-01 00:00:00
## 6     6 2020-12-31 03:00:00 "I always love going~ Casa di D~ 2021-01-01 00:00:00

13. Count each word

We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. After using unnest_tokens, we’ve split each row so that there is one token (word) in each row of the new data frame. We are using stop words to remove unnecesary words such as “is”,“I”, “and” etc.

# Review words tokenization
review_words <- df %>%
  distinct(review, .keep_all = TRUE) %>%
  unnest_tokens(word, review, drop = FALSE) %>%
  distinct(ID, word, .keep_all = TRUE) %>%
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "[^\\d]")) %>%
  group_by(word) %>%
  mutate(word_total = n()) %>%
  ungroup()

head(review_words[,c(1,3,4,6,7)])
## # A tibble: 6 x 5
##      ID review                                     restaurant word    word_total
##   <int> <chr>                                      <chr>      <chr>        <int>
## 1     1 What a delightful and beautiful place in ~ L'Atelier  deligh~         15
## 2     1 What a delightful and beautiful place in ~ L'Atelier  beauti~        108
## 3     1 What a delightful and beautiful place in ~ L'Atelier  heart           19
## 4     1 What a delightful and beautiful place in ~ L'Atelier  city            52
## 5     1 What a delightful and beautiful place in ~ L'Atelier  fell             2
## 6     1 What a delightful and beautiful place in ~ L'Atelier  love            33

14. Count and sort the words

We can also use dplyr’s count() to find the most common words in all reviews as a whole.

# Word counts sorted
word_counts <- review_words %>%
  count(word, sort = TRUE)

head(word_counts)
## # A tibble: 6 x 2
##   word           n
##   <chr>      <int>
## 1 food         733
## 2 service      490
## 3 restaurant   440
## 4 bucharest    324
## 5 menu         281
## 6 experience   277

15. The most common words found in reviews

This is the top 25 words found in reviews.

# Most common words
word_counts %>%
  head(25) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(fill = "lightblue") +
  scale_y_continuous(labels = comma_format()) +
  coord_flip() +
  labs(title = "Most common words in review text",
       y = "# of uses")+
  theme(plot.title = element_text(hjust = 0.5))

16. Term frequency in reviews

Let’s start by looking at the published reviews of the 5 restaurants and examine first term frequency, then tf-idf. We can start just by using dplyr verbs such as group_by() and join(). What are the most commonly used words in restaurant reviews? (Let’s also calculate the total words in each restaurant here, for later use.)

review_words_tf <- df %>%
  unnest_tokens(word, review) %>%
  count(restaurant,word,sort=TRUE)

total_review_words_tf <- review_words_tf %>%
  group_by(restaurant) %>%
  summarize(total = sum(n))

review_words_merge <- left_join(review_words_tf,total_review_words_tf)
## Joining, by = "restaurant"
head(review_words_merge)
## # A tibble: 6 x 4
##   restaurant                            word      n total
##   <chr>                                 <chr> <int> <int>
## 1 The Artist                            the    2052 29171
## 2 Relais & Chateaux Le Bistrot Francais the    1268 21471
## 3 Casa di David                         the    1069 16448
## 4 The Artist                            and    1060 29171
## 5 The Artist                            a       871 29171
## 6 Relais & Chateaux Le Bistrot Francais and     777 21471

17. Distribution of words

There is one row in this restaurant reviews data frame for each word-restaurant combination; n is the number of times that word is used in that restaurant and total is the total words in that review. The usual suspects are here with the highest n, “the”, “and”, “to”, and so forth. Let’s look at the distribution of n/total for each restaurant, the number of times a word appears in a restaurant divided by the total number of terms (words) in that restaurant This is exactly what term frequency is.

# Plot the distributions
ggplot(review_words_merge, aes(n/total, fill = restaurant)) +
  geom_histogram(show.legend = FALSE) +
  facet_wrap(~restaurant, ncol = 2, scales = "free_y")+
  ggtitle('Terms distribution of restarant reviews') +
  ylab("") + xlab("")+
  theme(plot.title = element_text(hjust = 0.5))

There are very long tails to the right for these restaurant reviews that we have not shown in these plots. These plots exhibit similar distributions for all the restaurant reviews, with many words that occur rarely and fewer words that occur frequently.

18. Zipf’s law

Distributions like those shown in figure above are typical in language. In fact, those types of long-tailed distributions are so common in any given corpus of natural language (like a book, or a lot of text from a website, or spoken words) that the relationship between the frequency that a word is used and its rank has been the subject of study; a classic version of this relationship is called Zipf’s law, after George Zipf, a 20th century American linguist. Zipf’s law states that the frequency that a word appears is inversely proportional to its rank. Since we have the data frame we used to plot term frequency, we can examine Zipf’s law for restaurants reviews with just a few lines of dplyr functions.

# Zipf law preprocessing
freq_by_rank <- review_words_merge %>% 
  group_by(restaurant) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total) %>%
  ungroup()

head(freq_by_rank)
## # A tibble: 6 x 6
##   restaurant                            word      n total  rank `term frequency`
##   <chr>                                 <chr> <int> <int> <int>            <dbl>
## 1 The Artist                            the    2052 29171     1           0.0703
## 2 Relais & Chateaux Le Bistrot Francais the    1268 21471     1           0.0591
## 3 Casa di David                         the    1069 16448     1           0.0650
## 4 The Artist                            and    1060 29171     2           0.0363
## 5 The Artist                            a       871 29171     3           0.0299
## 6 Relais & Chateaux Le Bistrot Francais and     777 21471     2           0.0362

19. First plot of Zipf’s law

The rank column here tells us the rank of each word within the frequency table; the table was already ordered by n so we could use row_number() to find the rank. Then, we can calculate the term frequency in the same way we did before. Zipf’s law is often visualized by plotting rank on the x-axis and term frequency on the y-axis, on logarithmic scales. Plotting this way, an inversely proportional relationship will have a constant, negative slope.

# Zipf law plot
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = restaurant)) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()

20. Filtering the rank subset and finding the slope

Notice that figure above is in log-log coordinates. We see that all five restaurants reviews are similar to each other, and that the relationship between rank and frequency does have negative slope. It is not quite constant, though; perhaps we could view this as a broken power law with, say, three sections. Let’s see what the exponent of the power law is for the middle section of the rank range.

rank_subset <- freq_by_rank %>% 
  filter(rank < 500,
         rank > 10)

lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)
## 
## Call:
## lm(formula = log10(`term frequency`) ~ log10(rank), data = rank_subset)
## 
## Coefficients:
## (Intercept)  log10(rank)  
##     -0.7794      -1.0217

21. Plot with the slope

We have in fact gotten a slope close to -1 here. Let’s plot this fitted power law with the data.

# Plot again adding the line
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = restaurant)) + 
  geom_abline(intercept = -0.78, slope = -1.02, 
              color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()+
  ggtitle("Zipf's Law") +
  theme(plot.title = element_text(hjust = 0.5))

We have found a result close to the classic version of Zipf’s law for the corpus of the five restaurants reviews. The deviations we see here at high rank are not uncommon for many kinds of language; a corpus of language often contains fewer rare words than predicted by a single power law. The deviations at low rank are more unusual. Reviewers use a lower percentage of the most common words than many collections of language.

22.The bind_tf_idf() function

The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of restaurants reviews as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. The bind_tf_idf() function in the tidytext package takes a tidy text data set as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (reviews in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example). We calculated a total for each restaurant for our explorations in previous sections, but it is not necessary for the bind_tf_idf() function; the table only needs to contain all the words in each document.

# Compute TF-IDF 
review_tf_idf <- review_words_merge %>%
  bind_tf_idf(word, restaurant, n)

head(review_tf_idf)
## # A tibble: 6 x 7
##   restaurant                            word      n total     tf   idf tf_idf
##   <chr>                                 <chr> <int> <int>  <dbl> <dbl>  <dbl>
## 1 The Artist                            the    2052 29171 0.0703     0      0
## 2 Relais & Chateaux Le Bistrot Francais the    1268 21471 0.0591     0      0
## 3 Casa di David                         the    1069 16448 0.0650     0      0
## 4 The Artist                            and    1060 29171 0.0363     0      0
## 5 The Artist                            a       871 29171 0.0299     0      0
## 6 Relais & Chateaux Le Bistrot Francais and     777 21471 0.0362     0      0

Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all five restaurants reviews, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.

23. Tf-idf rearranged

Let’s look at terms with high tf-idf in restaurants reviews.

# Tf-idf from highest to lowest
review_tf_idf %>%
  select(-total) %>%
  arrange(desc(tf_idf))
## # A tibble: 11,758 x 6
##    restaurant                            word          n      tf   idf  tf_idf
##    <chr>                                 <chr>     <int>   <dbl> <dbl>   <dbl>
##  1 Casa di David                         lake        103 0.00626  1.61 0.0101 
##  2 The Artist                            spoon       126 0.00432  1.61 0.00695
##  3 Kaiamo                                kaiamo       32 0.00343  1.61 0.00552
##  4 Casa di David                         casa         42 0.00255  1.61 0.00411
##  5 Casa di David                         di           41 0.00249  1.61 0.00401
##  6 Casa di David                         david        40 0.00243  1.61 0.00391
##  7 L'Atelier                             epoque       17 0.00199  1.61 0.00321
##  8 Relais & Chateaux Le Bistrot Francais bistrot      40 0.00186  1.61 0.00300
##  9 Relais & Chateaux Le Bistrot Francais francais     37 0.00172  1.61 0.00277
## 10 Casa di David                         herastrau    26 0.00158  1.61 0.00254
## # ... with 11,748 more rows

24. Custom stop words

Let’s remove some of these less meaningful words to make a better, more meaningful plot. Notice that we make a custom list of stop words and use anti_join() to remove them; this is a flexible approach that can be used in many situations. We will need to go back a few steps since we are removing words from the tidy data frame.

# Create custom stop words that will be removed from our data set
mystopwords <- tibble(word = c("11","p","le","di","casa","david","l'atelier","artist","kaiamo","le","bistro","brie","bistrot","francais"))

# Remove the stop words
review_tf_idf <- anti_join(review_tf_idf, mystopwords, 
                           by = "word")

25. Visualize the high tf-idf

Let’s look at a visualization for these high tf-idf words

# Plot highest tf-idfs by restaurant
review_tf_idf %>%
  group_by(restaurant) %>%
  slice_max(tf_idf, n = 10) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = restaurant)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~restaurant, ncol = 2, scales = "free") +
  labs(x = "tf-idf", y = NULL)

26 Tokenizing by n-gram

We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentence, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

# Bigrams
review_bigrams <- df %>%
  unnest_tokens(bigram, review, token = "ngrams", n = 2)

head(review_bigrams)
## # A tibble: 6 x 5
##      ID data                restaurant month               bigram         
##   <int> <dttm>              <chr>      <dttm>              <chr>          
## 1     1 2021-03-08 03:00:00 L'Atelier  2021-03-01 00:00:00 what a         
## 2     1 2021-03-08 03:00:00 L'Atelier  2021-03-01 00:00:00 a delightful   
## 3     1 2021-03-08 03:00:00 L'Atelier  2021-03-01 00:00:00 delightful and 
## 4     1 2021-03-08 03:00:00 L'Atelier  2021-03-01 00:00:00 and beautiful  
## 5     1 2021-03-08 03:00:00 L'Atelier  2021-03-01 00:00:00 beautiful place
## 6     1 2021-03-08 03:00:00 L'Atelier  2021-03-01 00:00:00 place in

27 Counting and filtering n-grams

Our usual tidy tools apply equally well to n-gram analysis. We can examine the most common bigrams using dplyr’s count(). As one might expect, a lot of the most common bigrams are pairs of common (uninteresting) words, such as of the and to be: what we call “stop-words”. This is a useful time to use tidyr’s separate(), which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.

# Separate the words
bigrams_separated <- review_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

# Filter the separated bigrams
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# Count the bigrams
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

# Unite the bigrams
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

# 15 most common bigrams
bigrams_united %>%
  count(bigram, sort = TRUE)%>%
  top_n(15) %>%
  gt()
bigram n
tasting menu 80
spoon tasting 70
fine dining 66
wine list 62
casa di 40
di david 40
highly recommend 39
le bistrot 36
michelin star 35
bistrot francais 34
cucumber sorbet 32
foie gras 31
highly recommended 30
sea bass 30
excellent service 27

28 Visualizing a network of bigrams with ggraph

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph.” Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes.

# Word pairs preprocess
review_subject <- df %>% 
  unnest_tokens(word, review) %>% 
  anti_join(stop_words)

my_stopwords <- data_frame(word = c(as.character(1:10)))
review_subject <- review_subject %>% 
  anti_join(my_stopwords)

# Pair wise count
title_word_pairs <- review_subject %>% 
  pairwise_count(word, ID, sort = TRUE, upper = FALSE)

# Word pairs plot
set.seed(1234)
title_word_pairs %>%
  filter(n >= 50) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  ggtitle('Word network in TripAdvisor reviews')+
  theme(plot.title = element_text(hjust = 0.5))+
  theme_void()

29. Pairwise correlation

We want to examine correlation among words, which indicates how often they appear together relative to how often they appear separately.In particular, here we’ll focus on the phi coefficient, a common measure for binary correlation. The focus of the phi coefficient is how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other.

# Correlations preprocess
review_word_cors <- review_subject %>% 
  group_by(word) %>%
  filter(n() >= 4) %>%
  pairwise_cor(word,ID, sort = TRUE, upper = FALSE)

# Correlation plot
set.seed(1234)
review_word_cors %>%
  filter(correlation > .5) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "royalblue") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  ggtitle('Correlation network in TripAdvisor reviews')+
  theme_void()+
  theme(plot.title = element_text(hjust = 0.5))

30. Trigrams

Let’s look in the same manner at trigrams instead of bigrams.

# Compute trigrams
review_trigrams <- df %>%
  unnest_tokens(trigram, review, token = "ngrams", n = 3)

# Separate the words
trigrams_separated <- review_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

# Filter the words
trigrams_filtered <- trigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word3 %in% stop_words$word)

# Count the trigrams
trigram_counts <- trigrams_filtered %>% 
  count(word1, word2, word3, sort = TRUE)

# Unite the trigrams
trigrams_united <- trigrams_filtered %>%
  unite(trigram, word1, word2, word3, sep = " ")

# Top 10 trigrams
trigrams_united %>%
  count(trigram, sort = TRUE) %>%
  top_n(10) %>%
  gt()
trigram n
casa di david 40
le bistrot francais 32
spoon tasting menu 21
extensive wine list 8
chef paul oppenkamp 7
fine dining restaurant 7
fine dining experience 6
spoon tasting menus 6
michelin star restaurant 5
cucumber ice cream 4
fine dining restaurants 4
free wi fi 4
michelin star restaurants 4

31. Overall wordcloud

We’ve seen that this tidy text mining approach works well with ggplot2, but having our data in a tidy format is useful for other plots as well. For example, consider the wordcloud package, which uses base R graphics. Let’s look at the most common words in restaurants reviews as a whole again, but this time as a wordcloud.

# Wordcloud for all the period
review_words %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

32. Wordcloud grouped by restaurants

This time, the wordcloud highlights the most used words of reviews of each restaurant.

# Cast the data from tidy format to corpus
corpus_review <- corpus(df, text_field = "review")

# Plot the grouped wordcloud
corpus_review%>%
  tokens(remove_punct = TRUE) %>%
  tokens(remove_symbols  = TRUE) %>%
  tokens(remove_numbers  = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  dfm() %>%
  dfm_group(groups = restaurant) %>%
  dfm_trim(min_termfreq = 5, verbose = FALSE) %>%
  textplot_wordcloud(comparison = TRUE)

33. Comparison cloud

Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words. Until the step where we need to send the data to comparison.cloud(), this can all be done with joins, piping, and dplyr because our data is in tidy format.

# Comparison wordcloud for all the period
review_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("blue", "red"),
                   max.words = 50)

34. Sentiment analysis for all the period

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.

# Remove missing values if any
reviews <- df %>% 
  filter(!is.na(review)) %>% 
  select(ID, review) %>% 
  group_by(row_number()) %>% 
  ungroup()

# Unnest tokens
tidy_reviews <- reviews %>%
  unnest_tokens(word, review)

# Remove stop words
tidy_reviews <- tidy_reviews %>%
  anti_join(stop_words)

# Get the sentiments using bing dictionary
bing_word_counts <- tidy_reviews %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

# Plot the most used negative and positive words in reviews
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip() + 
  ggtitle('Words that contribute to positive and negative sentiment in the reviews')+
  theme(plot.title = element_text(hjust = 0.5))

35. The words proceded by “not” that contribue the most to sentiment score

We can also use bigrams to identify the sentiment scores.

# Using bigrams to provide context in sentiment analysis
bigrams_separated %>%
  filter(word1 == "not") %>%
  count(word1, word2, sort = TRUE)
## # A tibble: 260 x 3
##    word1 word2            n
##    <chr> <chr>        <int>
##  1 not   a               34
##  2 not   to              22
##  3 not   only            18
##  4 not   the             17
##  5 not   too             17
##  6 not   be              15
##  7 not   disappointed    13
##  8 not   so              13
##  9 not   very            13
## 10 not   cheap           11
## # ... with 250 more rows
# Get the AFINN dictionary
AFINN <- get_sentiments("afinn")

# Filter for word not
not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, value, sort = TRUE) %>%
  ungroup()

# Plot the contribution 
not_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment score * number of occurrences") +
  ggtitle('The 20 words preceded by "not" that had the greatest contribution to 
          sentiment scores, positive or negative direction') +
  coord_flip()

36. Other negations

We can identify other type of negation words such as no, never, without.

negation_words <- c("not", "no", "never", "without")

# Filter for the negation words
negated_words <- bigrams_separated %>%
  filter(word1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, value, sort = TRUE) %>%
  ungroup()

# Plot the most common words to follow negations
negated_words %>%
  mutate(contribution = n * value,
         word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
  group_by(word1) %>%
  top_n(12, abs(contribution)) %>%
  ggplot(aes(word2, contribution, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free") +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab("Words preceded by negation term") +
  ylab("Sentiment score * # of occurrences") +
  ggtitle('The most common positive or negative words to follow negations 
          such as "no", "not", "never" and "without"') +
  coord_flip()

37. Get the most negative and most positive review

Let’s try to find out the most positive review and the most negative one

# Sentiment messages
sentiment_messages <- tidy_reviews %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(ID) %>%
  summarize(sentiment = mean(value),
            words = n()) %>%
  ungroup() %>%
  filter(words >= 5)

# Arrange descending to get the most positive review
sentiment_messages %>%
  arrange(desc(sentiment))
## # A tibble: 360 x 3
##       ID sentiment words
##    <int>     <dbl> <int>
##  1   571      3.6      5
##  2  1029      3.5      6
##  3   201      3.4      5
##  4   333      3.4      5
##  5   401      3.4      5
##  6   904      3.4      5
##  7   688      3.33     6
##  8   946      3.33     9
##  9   954      3.33     6
## 10   529      3.27    11
## # ... with 350 more rows
# The most positive review
df[ which(df$ID==571), ]$review[1]
## [1] "This is a fantastic restaurant. Located in a beautiful building it has a class not easily matched in Bucharest. The chef is outstanding. Service is excellent and the food served is well balanced and of excellent quality. One of the little gems on the menu: the Foie Gras."
# Arrange ascending to get the most negative review
sentiment_messages %>%
  arrange(sentiment)
## # A tibble: 360 x 3
##       ID sentiment words
##    <int>     <dbl> <int>
##  1    40    -2.43      7
##  2   782    -2.29      7
##  3    52    -2.2       5
##  4   717    -2.14     21
##  5     9    -1.6       5
##  6   115    -1.6       5
##  7     4    -1.5       6
##  8    24    -1.5       6
##  9  1059    -1.08     12
## 10    21    -0.833     6
## # ... with 350 more rows
#The most negative review
df[ which(df$ID==40), ]$review[1]
## [1] "I have been here for valentine's day, we booked online with a \"vegetarian option\". When we came the first impression was not that bad. good atmosphere and a pianist playing, but something terrible has to happen here. When the waiter gave us the menu all the vegetarian option was edit with a choose with only one plate ( a salad as a starter!). I asked the waiter why and he replied \"just because today we have special menu for the day\" but in their website they didn't make any note about the reduction of the menu. We also let him notice that a vegetarian option 'll be coming in the moment of booking. The waiter point out about there was for me a starter and should I go for! also he added that during the night there was a vegetarian person and not any adjustment was has been done on the menu. So i told him to double check with the chef. At the end they find a solution to let me have 3 tortelloni with ricotta cheese. The starter as a salad was awful too much vinegar and lentils without salt and no black truffles flavour was find in the dish ( I attached a pic so you can see how bad was!) For all the night the bad manner of the waiter continued. I left all the salad in my plate and he didn't ask me if I enjoyed or not. When the bill came they didn't take off my salad. TERRIBLE SERVICE. we 'll never go again. PLUS the sommelier haven't gave us any info about the type of grape into the wine."

Before and in pandemic outbreak

Let’s try now to do the same analysis as for overall reviews, but splitting by date. We will use data till 2019 as before pandemic and after 2020 as in pandemic outbreak for comparison purpose to check if the customer perception changed.

## Filter the data before pandemic
reviews_before_pandemic <- df %>% 
  filter(data < "2020-01-01") %>% 
  select(ID, review) %>% 
  group_by(row_number()) %>% 
  ungroup()

# Unnest tokens
tidy_reviews_before_pandemic <- reviews_before_pandemic %>%
  unnest_tokens(word, review)

# Remove stop words
tidy_reviews_before_pandemic <- tidy_reviews_before_pandemic %>%
  anti_join(stop_words)

# Join with the sentiment dictionary 
bing_word_counts_before_pandemic <- tidy_reviews_before_pandemic %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

# Filter the data in pandemic 
reviews_after_pandemic <- df %>% 
  filter(data >= "2020-01-01") %>% 
  select(ID, review) %>% 
  group_by(row_number()) %>% 
  ungroup()

# Unnest tokens for pandemic
tidy_reviews_after_pandemic <- reviews_after_pandemic %>%
  unnest_tokens(word, review)

# Remove the stop words
tidy_reviews_after_pandemic <- tidy_reviews_after_pandemic %>%
  anti_join(stop_words)

# Join with the sentiment dictionary
bing_word_counts_after_pandemic <- tidy_reviews_after_pandemic %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

# Plot the words that contribute to positive and negative sentiment
par(mfrow=c(1,2))

bing_word_counts_before_pandemic %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip() + 
  ggtitle('Words that contribute to positive and negative sentiment in the reviews before pandemic')+
  theme(plot.title = element_text(hjust = 0.5))

bing_word_counts_after_pandemic %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip() + 
  ggtitle('Words that contribute to positive and negative sentiment in the reviews in pandemic')+
  theme(plot.title = element_text(hjust = 0.5))

# Bigrams before pandemic
review_bigrams_before_pandemic <- df %>%
  filter(data < "2020-01-01") %>%
  unnest_tokens(bigram, review, token = "ngrams", n = 2)

# Separate the words
bigrams_separated_before_pandemic <- review_bigrams_before_pandemic %>%
  separate(bigram, c("word1", "word2"), sep = " ")

# Filter for not word
bigrams_separated_before_pandemic %>%
  filter(word1 == "not") %>%
  count(word1, word2, sort = TRUE)
## # A tibble: 251 x 3
##    word1 word2            n
##    <chr> <chr>        <int>
##  1 not   a               28
##  2 not   to              20
##  3 not   only            17
##  4 not   too             17
##  5 not   be              15
##  6 not   the             14
##  7 not   disappointed    13
##  8 not   so              12
##  9 not   very            12
## 10 not   cheap           11
## # ... with 241 more rows
AFINN <- get_sentiments("afinn")

# Combine the results
not_words_before_pandemic <- bigrams_separated_before_pandemic %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, value, sort = TRUE) %>%
  ungroup()

# Bigrams in pandemic
review_bigrams_after_pandemic <- df %>%
  filter(data > "2020-01-01") %>%
  unnest_tokens(bigram, review, token = "ngrams", n = 2)

bigrams_separated_after_pandemic <- review_bigrams_after_pandemic %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_separated_after_pandemic %>%
  filter(word1 == "not") %>%
  count(word1, word2, sort = TRUE)
## # A tibble: 25 x 3
##    word1 word2     n
##    <chr> <chr> <int>
##  1 not   a         6
##  2 not   the       3
##  3 not   all       2
##  4 not   at        2
##  5 not   easy      2
##  6 not   least     2
##  7 not   to        2
##  8 not   able      1
##  9 not   any       1
## 10 not   fancy     1
## # ... with 15 more rows
not_words_after_pandemic <- bigrams_separated_after_pandemic %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, value, sort = TRUE) %>%
  ungroup()

# Plot words preceded by "not" that had the greatest contribution to sentiment scores

par(mfrow=c(1,2))

not_words_before_pandemic %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment score * number of occurrences") +
  ggtitle('The 20 words preceded by "not" that had the greatest contribution to 
          sentiment scores, positive or negative direction before pandemic') +
  coord_flip()+
  theme(plot.title = element_text(hjust = 0.5))

not_words_after_pandemic %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment score * number of occurrences") +
  ggtitle('The 3 words preceded by "not" that had the greatest contribution to 
          sentiment scores, positive or negative direction in pandemic') +
  coord_flip()+
  theme(plot.title = element_text(hjust = 0.5))

# Negation words
negation_words <- c("not", "no", "never", "without")

# Filter for the new negation words
negated_words__before_pandemic <- bigrams_separated_before_pandemic %>%
  filter(word1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, value, sort = TRUE) %>%
  ungroup()

negated_words__after_pandemic <- bigrams_separated_after_pandemic %>%
  filter(word1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, value, sort = TRUE) %>%
  ungroup()

# Plot the most common positive or negative words to follow negations

par(mfrow=c(1,2))

negated_words__before_pandemic %>%
  mutate(contribution = n * value,
         word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
  group_by(word1) %>%
  top_n(12, abs(contribution)) %>%
  ggplot(aes(word2, contribution, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free") +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab("Words preceded by negation term") +
  ylab("Sentiment score * # of occurrences") +
  ggtitle('The most common positive or negative words to follow negations 
          such as "no", "not", "never" and "without" before pandemic') +
  coord_flip()+
  theme(plot.title = element_text(hjust = 0.5))

negated_words__after_pandemic %>%
  mutate(contribution = n * value,
         word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
  group_by(word1) %>%
  top_n(12, abs(contribution)) %>%
  ggplot(aes(word2, contribution, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free") +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab("Words preceded by negation term") +
  ylab("Sentiment score * # of occurrences") +
  ggtitle('The most common positive or negative words to follow negations 
          such as "no", "not", "never" and "without" in pandemic') +
  coord_flip()+
  theme(plot.title = element_text(hjust = 0.5))