This analysis compares the determinants of guest experience at the top 5 score luxury restaurants in Bucharest Romania before and during the pandemic-COVID-19.In particular, 1048 reviews were posted before the pandemic outbreak and 58 reviews were posted during the pandemic. The data was collected using web scrapping techniques on the TripAdvisor website for the following restaurants: Kaiamo, Casa di David, L’Atelier, The Artist and Relais & Chateaux Le Bistrot Francais.
For this analysis the tidyverse was used and some packages for text mining techniques.
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
library(tidytext)
library(tidyverse)
library(stringr)
library(tidyr)
library(scales)
library(broom)
library(purrr)
library(forcats)
library(widyr)
library(igraph)
library(ggraph)
library(SnowballC)
library(wordcloud)
library(reshape2)
library(textdata)
library(gt)
library(quanteda)
library(quanteda.textplots)
The reviews were scraped from TripAdvisor. The script was run up in April 2021 so new reviews might have been added in the meantime on the website.
df <- read_csv("Top5_luxury_restaurant_Bucharest.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## data = col_character(),
## review = col_character(),
## restaurant = col_character()
## )
The csv file contains 3 string columns: date, review and restaurant. The date column was transformed from string to date variable for further analysis.
df <- df[complete.cases(df), ]
df$data <- as.Date(df$data, format = "%B %d %Y")
The dimension of the data is 1106 observations x 3 columns. The earliest review is on 19th May 2010 and the latest one on 8th March 2021
# checking the dimension of the data
dim(df)
## [1] 1106 3
# earliest review
min(df$data)
## [1] "2010-05-19"
#latest review
max(df$data)
## [1] "2021-03-08"
We arranged the data from latest to oldest review in order to better have an overview of the data.
# Arrange the data from latest review to oldest
df <- df %>%
arrange(desc(data))
head(df)
## # A tibble: 6 x 3
## data review restaurant
## <date> <chr> <chr>
## 1 2021-03-08 "What a delightful and beautiful place in the heart o~ L'Atelier
## 2 2021-01-17 "A good restaurant has the same quality of the food y~ Casa di Dav~
## 3 2021-01-07 "don't usually leave reviews here, but this place is ~ Casa di Dav~
## 4 2021-01-07 "Incredibly rude staff, how can you ask people to lea~ Casa di Dav~
## 5 2021-01-01 "First of all, it\x92s a good thing that it\x92s open~ L'Atelier
## 6 2020-12-31 "I always love going back to Casa di David when visit~ Casa di Dav~
We grouped the data by each restaurant to check the first and the last review by each restaurant. We spot that Casa di David has the earliest review overall being the first luxury restaurant opened among the top 5. The most recent opened restaurant is Kaiamo in our dataset.
# Earliest/latest review by each restaurant
df %>%
group_by(restaurant) %>%
summarize(min_date = min(data),
max_date = max(data)) %>%
gt()
| restaurant | min_date | max_date |
|---|---|---|
| Casa di David | 2010-05-19 | 2021-01-17 |
| Kaiamo | 2018-09-14 | 2020-12-29 |
| L'Atelier | 2014-07-27 | 2021-03-08 |
| Relais & Chateaux Le Bistrot Francais | 2014-05-24 | 2020-10-17 |
| The Artist | 2017-08-25 | 2020-10-11 |
The most popular restaurant among top 5 is The Artist having the most reviews, followed by Relais & Chateaux and Casa di David. The most recent opened restaurant, Kaiamo has the lowest count of reviews.
# Count of review per restaurant overall
df %>%
group_by(restaurant) %>%
count(restaurant) %>%
arrange(n) %>%
ggplot(aes(reorder(restaurant,-n),n), fill = n)+
geom_bar(stat="identity",fill = "#FF6666") +
labs(title = "Count of reviews by restaurant",
y = "# of reviews", x = "Restaurants")+
theme(plot.title = element_text(hjust = 0.5))
Most reviews were written on the website before the pandemic outbreak.
# Count of review per restaurant before pandemic
df %>%
group_by(restaurant) %>%
filter(data < "2020-01-01") %>%
count(restaurant) %>%
arrange(n) %>%
ggplot(aes(reorder(restaurant,-n),n), fill = n)+
geom_bar(stat="identity",fill = "#FF6666") +
labs(title = "Count of reviews by restaurant before pandemic",
y = "# of reviews", x = "Restaurants")+
theme(plot.title = element_text(hjust = 0.5))
Howevevr, we can see that during the pandemic outbreak there were also reviews added on the website that we are going to analyze further. We can see that Kaiamo gained popularity in pandemic outbreak and Relais & Chateaux lost popularity.
# Count of review per restaurant in pandemic
df %>%
group_by(restaurant) %>%
filter(data >= "2020-01-01") %>%
count(restaurant) %>%
arrange(n) %>%
ggplot(aes(reorder(restaurant,-n),n), fill = n)+
geom_bar(stat="identity",fill = "#FF6666") +
labs(title = "Count of reviews by restaurant in pandemic",
y = "# of reviews", x = "Restaurants")+
theme(plot.title = element_text(hjust = 0.5))
We can see in the below plot that most of the reviews were written from 2018 to 2020, just before the pandemic outbreak.
# Number of reviews per month
df %>%
count(Month = round_date(data, "month")) %>%
ggplot(aes(Month, n)) +
geom_line() +
ggtitle('The Number of Reviews Per Month')+
theme(plot.title = element_text(hjust = 0.5))+
scale_x_date(date_breaks = "1 year",date_labels = "%Y")
Casa di David,L’Atelier and Relais & Chateaux have a constant number of reviews, while Kaiamo and The Artist suffered a lot in pandemic regarding given reviews on TripAdvisor
# Number of reviews per month per restaurant
df %>%
count(Month = round_date(data, "month"),restaurant) %>%
ggplot(aes(Month, n)) +
geom_line() +
ggtitle('The Number of Reviews Per Month by Restaurant')+
theme(plot.title = element_text(hjust = 0.5))+
scale_x_date(date_breaks = "3 year",date_labels = "%Y") +
facet_wrap(~ restaurant)
We will assign an unique id for each row in our data set for further purpose. We also create a month column.
# ID assign
df <- tibble::rowid_to_column(df, "ID")
# Mutate the month column
df <- df %>%
mutate(data = as.POSIXct(data, origin = "1970-01-01"),month = round_date(data, "month"))
head(df)
## # A tibble: 6 x 5
## ID data review restaurant month
## <int> <dttm> <chr> <chr> <dttm>
## 1 1 2021-03-08 03:00:00 "What a delightful a~ L'Atelier 2021-03-01 00:00:00
## 2 2 2021-01-17 03:00:00 "A good restaurant h~ Casa di D~ 2021-02-01 00:00:00
## 3 3 2021-01-07 03:00:00 "don't usually leave~ Casa di D~ 2021-01-01 00:00:00
## 4 4 2021-01-07 03:00:00 "Incredibly rude sta~ Casa di D~ 2021-01-01 00:00:00
## 5 5 2021-01-01 03:00:00 "First of all, it\x9~ L'Atelier 2021-01-01 00:00:00
## 6 6 2020-12-31 03:00:00 "I always love going~ Casa di D~ 2021-01-01 00:00:00
We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. After using unnest_tokens, we’ve split each row so that there is one token (word) in each row of the new data frame. We are using stop words to remove unnecesary words such as “is”,“I”, “and” etc.
# Review words tokenization
review_words <- df %>%
distinct(review, .keep_all = TRUE) %>%
unnest_tokens(word, review, drop = FALSE) %>%
distinct(ID, word, .keep_all = TRUE) %>%
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "[^\\d]")) %>%
group_by(word) %>%
mutate(word_total = n()) %>%
ungroup()
head(review_words[,c(1,3,4,6,7)])
## # A tibble: 6 x 5
## ID review restaurant word word_total
## <int> <chr> <chr> <chr> <int>
## 1 1 What a delightful and beautiful place in ~ L'Atelier deligh~ 15
## 2 1 What a delightful and beautiful place in ~ L'Atelier beauti~ 108
## 3 1 What a delightful and beautiful place in ~ L'Atelier heart 19
## 4 1 What a delightful and beautiful place in ~ L'Atelier city 52
## 5 1 What a delightful and beautiful place in ~ L'Atelier fell 2
## 6 1 What a delightful and beautiful place in ~ L'Atelier love 33
We can also use dplyr’s count() to find the most common words in all reviews as a whole.
# Word counts sorted
word_counts <- review_words %>%
count(word, sort = TRUE)
head(word_counts)
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 food 733
## 2 service 490
## 3 restaurant 440
## 4 bucharest 324
## 5 menu 281
## 6 experience 277
This is the top 25 words found in reviews.
# Most common words
word_counts %>%
head(25) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col(fill = "lightblue") +
scale_y_continuous(labels = comma_format()) +
coord_flip() +
labs(title = "Most common words in review text",
y = "# of uses")+
theme(plot.title = element_text(hjust = 0.5))
Let’s start by looking at the published reviews of the 5 restaurants and examine first term frequency, then tf-idf. We can start just by using dplyr verbs such as group_by() and join(). What are the most commonly used words in restaurant reviews? (Let’s also calculate the total words in each restaurant here, for later use.)
review_words_tf <- df %>%
unnest_tokens(word, review) %>%
count(restaurant,word,sort=TRUE)
total_review_words_tf <- review_words_tf %>%
group_by(restaurant) %>%
summarize(total = sum(n))
review_words_merge <- left_join(review_words_tf,total_review_words_tf)
## Joining, by = "restaurant"
head(review_words_merge)
## # A tibble: 6 x 4
## restaurant word n total
## <chr> <chr> <int> <int>
## 1 The Artist the 2052 29171
## 2 Relais & Chateaux Le Bistrot Francais the 1268 21471
## 3 Casa di David the 1069 16448
## 4 The Artist and 1060 29171
## 5 The Artist a 871 29171
## 6 Relais & Chateaux Le Bistrot Francais and 777 21471
There is one row in this restaurant reviews data frame for each word-restaurant combination; n is the number of times that word is used in that restaurant and total is the total words in that review. The usual suspects are here with the highest n, “the”, “and”, “to”, and so forth. Let’s look at the distribution of n/total for each restaurant, the number of times a word appears in a restaurant divided by the total number of terms (words) in that restaurant This is exactly what term frequency is.
# Plot the distributions
ggplot(review_words_merge, aes(n/total, fill = restaurant)) +
geom_histogram(show.legend = FALSE) +
facet_wrap(~restaurant, ncol = 2, scales = "free_y")+
ggtitle('Terms distribution of restarant reviews') +
ylab("") + xlab("")+
theme(plot.title = element_text(hjust = 0.5))
There are very long tails to the right for these restaurant reviews that we have not shown in these plots. These plots exhibit similar distributions for all the restaurant reviews, with many words that occur rarely and fewer words that occur frequently.
Distributions like those shown in figure above are typical in language. In fact, those types of long-tailed distributions are so common in any given corpus of natural language (like a book, or a lot of text from a website, or spoken words) that the relationship between the frequency that a word is used and its rank has been the subject of study; a classic version of this relationship is called Zipf’s law, after George Zipf, a 20th century American linguist. Zipf’s law states that the frequency that a word appears is inversely proportional to its rank. Since we have the data frame we used to plot term frequency, we can examine Zipf’s law for restaurants reviews with just a few lines of dplyr functions.
# Zipf law preprocessing
freq_by_rank <- review_words_merge %>%
group_by(restaurant) %>%
mutate(rank = row_number(),
`term frequency` = n/total) %>%
ungroup()
head(freq_by_rank)
## # A tibble: 6 x 6
## restaurant word n total rank `term frequency`
## <chr> <chr> <int> <int> <int> <dbl>
## 1 The Artist the 2052 29171 1 0.0703
## 2 Relais & Chateaux Le Bistrot Francais the 1268 21471 1 0.0591
## 3 Casa di David the 1069 16448 1 0.0650
## 4 The Artist and 1060 29171 2 0.0363
## 5 The Artist a 871 29171 3 0.0299
## 6 Relais & Chateaux Le Bistrot Francais and 777 21471 2 0.0362
The rank column here tells us the rank of each word within the frequency table; the table was already ordered by n so we could use row_number() to find the rank. Then, we can calculate the term frequency in the same way we did before. Zipf’s law is often visualized by plotting rank on the x-axis and term frequency on the y-axis, on logarithmic scales. Plotting this way, an inversely proportional relationship will have a constant, negative slope.
# Zipf law plot
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = restaurant)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
Notice that figure above is in log-log coordinates. We see that all five restaurants reviews are similar to each other, and that the relationship between rank and frequency does have negative slope. It is not quite constant, though; perhaps we could view this as a broken power law with, say, three sections. Let’s see what the exponent of the power law is for the middle section of the rank range.
rank_subset <- freq_by_rank %>%
filter(rank < 500,
rank > 10)
lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)
##
## Call:
## lm(formula = log10(`term frequency`) ~ log10(rank), data = rank_subset)
##
## Coefficients:
## (Intercept) log10(rank)
## -0.7794 -1.0217
We have in fact gotten a slope close to -1 here. Let’s plot this fitted power law with the data.
# Plot again adding the line
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = restaurant)) +
geom_abline(intercept = -0.78, slope = -1.02,
color = "gray50", linetype = 2) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()+
ggtitle("Zipf's Law") +
theme(plot.title = element_text(hjust = 0.5))
We have found a result close to the classic version of Zipf’s law for the corpus of the five restaurants reviews. The deviations we see here at high rank are not uncommon for many kinds of language; a corpus of language often contains fewer rare words than predicted by a single power law. The deviations at low rank are more unusual. Reviewers use a lower percentage of the most common words than many collections of language.
The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of restaurants reviews as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. The bind_tf_idf() function in the tidytext package takes a tidy text data set as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (reviews in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example). We calculated a total for each restaurant for our explorations in previous sections, but it is not necessary for the bind_tf_idf() function; the table only needs to contain all the words in each document.
# Compute TF-IDF
review_tf_idf <- review_words_merge %>%
bind_tf_idf(word, restaurant, n)
head(review_tf_idf)
## # A tibble: 6 x 7
## restaurant word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 The Artist the 2052 29171 0.0703 0 0
## 2 Relais & Chateaux Le Bistrot Francais the 1268 21471 0.0591 0 0
## 3 Casa di David the 1069 16448 0.0650 0 0
## 4 The Artist and 1060 29171 0.0363 0 0
## 5 The Artist a 871 29171 0.0299 0 0
## 6 Relais & Chateaux Le Bistrot Francais and 777 21471 0.0362 0 0
Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all five restaurants reviews, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.
Let’s look at terms with high tf-idf in restaurants reviews.
# Tf-idf from highest to lowest
review_tf_idf %>%
select(-total) %>%
arrange(desc(tf_idf))
## # A tibble: 11,758 x 6
## restaurant word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Casa di David lake 103 0.00626 1.61 0.0101
## 2 The Artist spoon 126 0.00432 1.61 0.00695
## 3 Kaiamo kaiamo 32 0.00343 1.61 0.00552
## 4 Casa di David casa 42 0.00255 1.61 0.00411
## 5 Casa di David di 41 0.00249 1.61 0.00401
## 6 Casa di David david 40 0.00243 1.61 0.00391
## 7 L'Atelier epoque 17 0.00199 1.61 0.00321
## 8 Relais & Chateaux Le Bistrot Francais bistrot 40 0.00186 1.61 0.00300
## 9 Relais & Chateaux Le Bistrot Francais francais 37 0.00172 1.61 0.00277
## 10 Casa di David herastrau 26 0.00158 1.61 0.00254
## # ... with 11,748 more rows
Let’s remove some of these less meaningful words to make a better, more meaningful plot. Notice that we make a custom list of stop words and use anti_join() to remove them; this is a flexible approach that can be used in many situations. We will need to go back a few steps since we are removing words from the tidy data frame.
# Create custom stop words that will be removed from our data set
mystopwords <- tibble(word = c("11","p","le","di","casa","david","l'atelier","artist","kaiamo","le","bistro","brie","bistrot","francais"))
# Remove the stop words
review_tf_idf <- anti_join(review_tf_idf, mystopwords,
by = "word")
Let’s look at a visualization for these high tf-idf words
# Plot highest tf-idfs by restaurant
review_tf_idf %>%
group_by(restaurant) %>%
slice_max(tf_idf, n = 10) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = restaurant)) +
geom_col(show.legend = FALSE) +
facet_wrap(~restaurant, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL)
We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentence, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.
# Bigrams
review_bigrams <- df %>%
unnest_tokens(bigram, review, token = "ngrams", n = 2)
head(review_bigrams)
## # A tibble: 6 x 5
## ID data restaurant month bigram
## <int> <dttm> <chr> <dttm> <chr>
## 1 1 2021-03-08 03:00:00 L'Atelier 2021-03-01 00:00:00 what a
## 2 1 2021-03-08 03:00:00 L'Atelier 2021-03-01 00:00:00 a delightful
## 3 1 2021-03-08 03:00:00 L'Atelier 2021-03-01 00:00:00 delightful and
## 4 1 2021-03-08 03:00:00 L'Atelier 2021-03-01 00:00:00 and beautiful
## 5 1 2021-03-08 03:00:00 L'Atelier 2021-03-01 00:00:00 beautiful place
## 6 1 2021-03-08 03:00:00 L'Atelier 2021-03-01 00:00:00 place in
Our usual tidy tools apply equally well to n-gram analysis. We can examine the most common bigrams using dplyr’s count(). As one might expect, a lot of the most common bigrams are pairs of common (uninteresting) words, such as of the and to be: what we call “stop-words”. This is a useful time to use tidyr’s separate(), which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.
# Separate the words
bigrams_separated <- review_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
# Filter the separated bigrams
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# Count the bigrams
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
# Unite the bigrams
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
# 15 most common bigrams
bigrams_united %>%
count(bigram, sort = TRUE)%>%
top_n(15) %>%
gt()
| bigram | n |
|---|---|
| tasting menu | 80 |
| spoon tasting | 70 |
| fine dining | 66 |
| wine list | 62 |
| casa di | 40 |
| di david | 40 |
| highly recommend | 39 |
| le bistrot | 36 |
| michelin star | 35 |
| bistrot francais | 34 |
| cucumber sorbet | 32 |
| foie gras | 31 |
| highly recommended | 30 |
| sea bass | 30 |
| excellent service | 27 |
We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph.” Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes.
# Word pairs preprocess
review_subject <- df %>%
unnest_tokens(word, review) %>%
anti_join(stop_words)
my_stopwords <- data_frame(word = c(as.character(1:10)))
review_subject <- review_subject %>%
anti_join(my_stopwords)
# Pair wise count
title_word_pairs <- review_subject %>%
pairwise_count(word, ID, sort = TRUE, upper = FALSE)
# Word pairs plot
set.seed(1234)
title_word_pairs %>%
filter(n >= 50) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
ggtitle('Word network in TripAdvisor reviews')+
theme(plot.title = element_text(hjust = 0.5))+
theme_void()
We want to examine correlation among words, which indicates how often they appear together relative to how often they appear separately.In particular, here we’ll focus on the phi coefficient, a common measure for binary correlation. The focus of the phi coefficient is how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other.
# Correlations preprocess
review_word_cors <- review_subject %>%
group_by(word) %>%
filter(n() >= 4) %>%
pairwise_cor(word,ID, sort = TRUE, upper = FALSE)
# Correlation plot
set.seed(1234)
review_word_cors %>%
filter(correlation > .5) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "royalblue") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
ggtitle('Correlation network in TripAdvisor reviews')+
theme_void()+
theme(plot.title = element_text(hjust = 0.5))
Let’s look in the same manner at trigrams instead of bigrams.
# Compute trigrams
review_trigrams <- df %>%
unnest_tokens(trigram, review, token = "ngrams", n = 3)
# Separate the words
trigrams_separated <- review_trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
# Filter the words
trigrams_filtered <- trigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word)
# Count the trigrams
trigram_counts <- trigrams_filtered %>%
count(word1, word2, word3, sort = TRUE)
# Unite the trigrams
trigrams_united <- trigrams_filtered %>%
unite(trigram, word1, word2, word3, sep = " ")
# Top 10 trigrams
trigrams_united %>%
count(trigram, sort = TRUE) %>%
top_n(10) %>%
gt()
| trigram | n |
|---|---|
| casa di david | 40 |
| le bistrot francais | 32 |
| spoon tasting menu | 21 |
| extensive wine list | 8 |
| chef paul oppenkamp | 7 |
| fine dining restaurant | 7 |
| fine dining experience | 6 |
| spoon tasting menus | 6 |
| michelin star restaurant | 5 |
| cucumber ice cream | 4 |
| fine dining restaurants | 4 |
| free wi fi | 4 |
| michelin star restaurants | 4 |
We’ve seen that this tidy text mining approach works well with ggplot2, but having our data in a tidy format is useful for other plots as well. For example, consider the wordcloud package, which uses base R graphics. Let’s look at the most common words in restaurants reviews as a whole again, but this time as a wordcloud.
# Wordcloud for all the period
review_words %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
This time, the wordcloud highlights the most used words of reviews of each restaurant.
# Cast the data from tidy format to corpus
corpus_review <- corpus(df, text_field = "review")
# Plot the grouped wordcloud
corpus_review%>%
tokens(remove_punct = TRUE) %>%
tokens(remove_symbols = TRUE) %>%
tokens(remove_numbers = TRUE) %>%
tokens_remove(stopwords("english")) %>%
dfm() %>%
dfm_group(groups = restaurant) %>%
dfm_trim(min_termfreq = 5, verbose = FALSE) %>%
textplot_wordcloud(comparison = TRUE)
Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words. Until the step where we need to send the data to comparison.cloud(), this can all be done with joins, piping, and dplyr because our data is in tidy format.
# Comparison wordcloud for all the period
review_words %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("blue", "red"),
max.words = 50)
One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.
# Remove missing values if any
reviews <- df %>%
filter(!is.na(review)) %>%
select(ID, review) %>%
group_by(row_number()) %>%
ungroup()
# Unnest tokens
tidy_reviews <- reviews %>%
unnest_tokens(word, review)
# Remove stop words
tidy_reviews <- tidy_reviews %>%
anti_join(stop_words)
# Get the sentiments using bing dictionary
bing_word_counts <- tidy_reviews %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
# Plot the most used negative and positive words in reviews
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip() +
ggtitle('Words that contribute to positive and negative sentiment in the reviews')+
theme(plot.title = element_text(hjust = 0.5))
We can also use bigrams to identify the sentiment scores.
# Using bigrams to provide context in sentiment analysis
bigrams_separated %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE)
## # A tibble: 260 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 not a 34
## 2 not to 22
## 3 not only 18
## 4 not the 17
## 5 not too 17
## 6 not be 15
## 7 not disappointed 13
## 8 not so 13
## 9 not very 13
## 10 not cheap 11
## # ... with 250 more rows
# Get the AFINN dictionary
AFINN <- get_sentiments("afinn")
# Filter for word not
not_words <- bigrams_separated %>%
filter(word1 == "not") %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE) %>%
ungroup()
# Plot the contribution
not_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(word2, n * value, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words preceded by \"not\"") +
ylab("Sentiment score * number of occurrences") +
ggtitle('The 20 words preceded by "not" that had the greatest contribution to
sentiment scores, positive or negative direction') +
coord_flip()
We can identify other type of negation words such as no, never, without.
negation_words <- c("not", "no", "never", "without")
# Filter for the negation words
negated_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE) %>%
ungroup()
# Plot the most common words to follow negations
negated_words %>%
mutate(contribution = n * value,
word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
group_by(word1) %>%
top_n(12, abs(contribution)) %>%
ggplot(aes(word2, contribution, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ word1, scales = "free") +
scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
xlab("Words preceded by negation term") +
ylab("Sentiment score * # of occurrences") +
ggtitle('The most common positive or negative words to follow negations
such as "no", "not", "never" and "without"') +
coord_flip()
Let’s try to find out the most positive review and the most negative one
# Sentiment messages
sentiment_messages <- tidy_reviews %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(ID) %>%
summarize(sentiment = mean(value),
words = n()) %>%
ungroup() %>%
filter(words >= 5)
# Arrange descending to get the most positive review
sentiment_messages %>%
arrange(desc(sentiment))
## # A tibble: 360 x 3
## ID sentiment words
## <int> <dbl> <int>
## 1 571 3.6 5
## 2 1029 3.5 6
## 3 201 3.4 5
## 4 333 3.4 5
## 5 401 3.4 5
## 6 904 3.4 5
## 7 688 3.33 6
## 8 946 3.33 9
## 9 954 3.33 6
## 10 529 3.27 11
## # ... with 350 more rows
# The most positive review
df[ which(df$ID==571), ]$review[1]
## [1] "This is a fantastic restaurant. Located in a beautiful building it has a class not easily matched in Bucharest. The chef is outstanding. Service is excellent and the food served is well balanced and of excellent quality. One of the little gems on the menu: the Foie Gras."
# Arrange ascending to get the most negative review
sentiment_messages %>%
arrange(sentiment)
## # A tibble: 360 x 3
## ID sentiment words
## <int> <dbl> <int>
## 1 40 -2.43 7
## 2 782 -2.29 7
## 3 52 -2.2 5
## 4 717 -2.14 21
## 5 9 -1.6 5
## 6 115 -1.6 5
## 7 4 -1.5 6
## 8 24 -1.5 6
## 9 1059 -1.08 12
## 10 21 -0.833 6
## # ... with 350 more rows
#The most negative review
df[ which(df$ID==40), ]$review[1]
## [1] "I have been here for valentine's day, we booked online with a \"vegetarian option\". When we came the first impression was not that bad. good atmosphere and a pianist playing, but something terrible has to happen here. When the waiter gave us the menu all the vegetarian option was edit with a choose with only one plate ( a salad as a starter!). I asked the waiter why and he replied \"just because today we have special menu for the day\" but in their website they didn't make any note about the reduction of the menu. We also let him notice that a vegetarian option 'll be coming in the moment of booking. The waiter point out about there was for me a starter and should I go for! also he added that during the night there was a vegetarian person and not any adjustment was has been done on the menu. So i told him to double check with the chef. At the end they find a solution to let me have 3 tortelloni with ricotta cheese. The starter as a salad was awful too much vinegar and lentils without salt and no black truffles flavour was find in the dish ( I attached a pic so you can see how bad was!) For all the night the bad manner of the waiter continued. I left all the salad in my plate and he didn't ask me if I enjoyed or not. When the bill came they didn't take off my salad. TERRIBLE SERVICE. we 'll never go again. PLUS the sommelier haven't gave us any info about the type of grape into the wine."
Let’s try now to do the same analysis as for overall reviews, but splitting by date. We will use data till 2019 as before pandemic and after 2020 as in pandemic outbreak for comparison purpose to check if the customer perception changed.
## Filter the data before pandemic
reviews_before_pandemic <- df %>%
filter(data < "2020-01-01") %>%
select(ID, review) %>%
group_by(row_number()) %>%
ungroup()
# Unnest tokens
tidy_reviews_before_pandemic <- reviews_before_pandemic %>%
unnest_tokens(word, review)
# Remove stop words
tidy_reviews_before_pandemic <- tidy_reviews_before_pandemic %>%
anti_join(stop_words)
# Join with the sentiment dictionary
bing_word_counts_before_pandemic <- tidy_reviews_before_pandemic %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
# Filter the data in pandemic
reviews_after_pandemic <- df %>%
filter(data >= "2020-01-01") %>%
select(ID, review) %>%
group_by(row_number()) %>%
ungroup()
# Unnest tokens for pandemic
tidy_reviews_after_pandemic <- reviews_after_pandemic %>%
unnest_tokens(word, review)
# Remove the stop words
tidy_reviews_after_pandemic <- tidy_reviews_after_pandemic %>%
anti_join(stop_words)
# Join with the sentiment dictionary
bing_word_counts_after_pandemic <- tidy_reviews_after_pandemic %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
# Plot the words that contribute to positive and negative sentiment
par(mfrow=c(1,2))
bing_word_counts_before_pandemic %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip() +
ggtitle('Words that contribute to positive and negative sentiment in the reviews before pandemic')+
theme(plot.title = element_text(hjust = 0.5))
bing_word_counts_after_pandemic %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip() +
ggtitle('Words that contribute to positive and negative sentiment in the reviews in pandemic')+
theme(plot.title = element_text(hjust = 0.5))
# Bigrams before pandemic
review_bigrams_before_pandemic <- df %>%
filter(data < "2020-01-01") %>%
unnest_tokens(bigram, review, token = "ngrams", n = 2)
# Separate the words
bigrams_separated_before_pandemic <- review_bigrams_before_pandemic %>%
separate(bigram, c("word1", "word2"), sep = " ")
# Filter for not word
bigrams_separated_before_pandemic %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE)
## # A tibble: 251 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 not a 28
## 2 not to 20
## 3 not only 17
## 4 not too 17
## 5 not be 15
## 6 not the 14
## 7 not disappointed 13
## 8 not so 12
## 9 not very 12
## 10 not cheap 11
## # ... with 241 more rows
AFINN <- get_sentiments("afinn")
# Combine the results
not_words_before_pandemic <- bigrams_separated_before_pandemic %>%
filter(word1 == "not") %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE) %>%
ungroup()
# Bigrams in pandemic
review_bigrams_after_pandemic <- df %>%
filter(data > "2020-01-01") %>%
unnest_tokens(bigram, review, token = "ngrams", n = 2)
bigrams_separated_after_pandemic <- review_bigrams_after_pandemic %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_separated_after_pandemic %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE)
## # A tibble: 25 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 not a 6
## 2 not the 3
## 3 not all 2
## 4 not at 2
## 5 not easy 2
## 6 not least 2
## 7 not to 2
## 8 not able 1
## 9 not any 1
## 10 not fancy 1
## # ... with 15 more rows
not_words_after_pandemic <- bigrams_separated_after_pandemic %>%
filter(word1 == "not") %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE) %>%
ungroup()
# Plot words preceded by "not" that had the greatest contribution to sentiment scores
par(mfrow=c(1,2))
not_words_before_pandemic %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(word2, n * value, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words preceded by \"not\"") +
ylab("Sentiment score * number of occurrences") +
ggtitle('The 20 words preceded by "not" that had the greatest contribution to
sentiment scores, positive or negative direction before pandemic') +
coord_flip()+
theme(plot.title = element_text(hjust = 0.5))
not_words_after_pandemic %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(word2, n * value, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words preceded by \"not\"") +
ylab("Sentiment score * number of occurrences") +
ggtitle('The 3 words preceded by "not" that had the greatest contribution to
sentiment scores, positive or negative direction in pandemic') +
coord_flip()+
theme(plot.title = element_text(hjust = 0.5))
# Negation words
negation_words <- c("not", "no", "never", "without")
# Filter for the new negation words
negated_words__before_pandemic <- bigrams_separated_before_pandemic %>%
filter(word1 %in% negation_words) %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE) %>%
ungroup()
negated_words__after_pandemic <- bigrams_separated_after_pandemic %>%
filter(word1 %in% negation_words) %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE) %>%
ungroup()
# Plot the most common positive or negative words to follow negations
par(mfrow=c(1,2))
negated_words__before_pandemic %>%
mutate(contribution = n * value,
word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
group_by(word1) %>%
top_n(12, abs(contribution)) %>%
ggplot(aes(word2, contribution, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ word1, scales = "free") +
scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
xlab("Words preceded by negation term") +
ylab("Sentiment score * # of occurrences") +
ggtitle('The most common positive or negative words to follow negations
such as "no", "not", "never" and "without" before pandemic') +
coord_flip()+
theme(plot.title = element_text(hjust = 0.5))
negated_words__after_pandemic %>%
mutate(contribution = n * value,
word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
group_by(word1) %>%
top_n(12, abs(contribution)) %>%
ggplot(aes(word2, contribution, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ word1, scales = "free") +
scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
xlab("Words preceded by negation term") +
ylab("Sentiment score * # of occurrences") +
ggtitle('The most common positive or negative words to follow negations
such as "no", "not", "never" and "without" in pandemic') +
coord_flip()+
theme(plot.title = element_text(hjust = 0.5))