Executive summary

This report will use 515k hotel reviews data in Europe dataseta as original data to analysis. The reason for choosing hotel review data for analysis is that hotel booking is something every traveler does. When a traveler needs to evaluate a hotel, the hotel’s scores and reviews on the booking website will largely affect the traveler’s choice of the hotel. Sentimental evaluations usually lead to more emotional choices for the traveler. Thus, in order to better understand the sentiments of reviewers when they make reviews and score evaluations, and whether those sentiments are completely due to personal factors of reviewers, 515k hotel review data in Europe are selected for analysis.

And among the 218 nationalities of reviewers in this dataset, the reason for choosing the reviewers of United Kingdom nationality is that the number of reviews made by United Kingdom reviewers is in the first position in this dataset. And the gentleman temperament of the United Kingdom people has been well-known from ancient times to the present, and their attitude towards hotel reviews has certain research significance. Therefore, the choice of reviewers with United Kingdom nationality for research, firstly, there is enough samples for analysis, which makes the whole research more directional; secondly, there are enough sentimental reasons to study the correlation between the reviewing habits of United Kingdom people and their gentleman temperament.

This report will ask the following two research questions and solve them with data analysis:

What are the most frequent words and phrases in positive and negative reviews? And which words are most likely to appear together?
Is the sentiment of reviews correlated with reviewer scores?

Through the data analysis of these two research questions, this report will observe the occurrence frequency of words and phrases in reviews and explore which words are highly correlated. In addition, this report will also observe the relationship between sentimental reviews and reviewer scores made by United Kingdom reviewers. By observing and analyzing the data, it returns to the reality of the impact of hotel review and scores on hotel selection, and provides a quantitative research basis for further research on the impact of negative sentimental reviews on hotel selection.

Data background

Explain where the data came from, what agency or company made it, how it is structured, what it shows, etc. The data was reaped from Booking.com and originally owned by Booking.com. The datasets contains 515,000 customer reviewers and scoring of 1493 luxury hotels across Europe. There are 17 variables in the whole datasets. In this report, ‘Reviewers_Nationality’ will be firstly chosen. And according to the two research questions, the following variables are selected for research:

In the research on positive/negative term frequency, two variables, ‘Negative_Review’ and ‘Positive_Review’ will be used. These variables represent negative review text and positive review text respectively.
In the research on the correlation between review sentiment and scores, three variables, Negative_Review, Positive_Review, and Reviewer_Score, will be used. They represent negative review text, and positive review text, respectively.

Data loading, cleaning and preprocessing

1 Data loading and cleaning

First of all, it is necessary to load the data required for this research, and use the select function to extract the required variables. This can make the whole data analysis process more clear.

# load the raw data
raw_review <- read_csv("data/Hotel_Reviews.csv")

## Rows: 515738 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Hotel_Address, Review_Date, Hotel_Name, Reviewer_Nationality, Negat...
## dbl (9): Additional_Number_of_Scoring, Average_Score, Review_Total_Negative_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Put the original data into the new object 'review', and use the select function to extract the required variables
review <- raw_review %>% select(Reviewer_Nationality,               # reviewer nationality
                                Negative_Review,                    # negative review text
                                Positive_Review,                    # positive review text
                                Reviewer_Score)                     # reviewer score

Sorting the number of reviews by the nationality of the reviewers through the count function, it can be seen that the reviewers whose nationality is the United Kingdom made the most reviews, reaching 170,329. Therefore, the sample size is sufficient for this analysis of sentiment reviews on United Kingdom reviewers. So next, use the filter function to extract reviews from reviewers whose nationality is the United Kingdom to make the analysis process clearer.

# Sort the number of reviews by the nationality of the reviewers using the count function
review %>%
    count(Reviewer_Nationality, sort = T)

## # A tibble: 227 × 2
##    Reviewer_Nationality          n
##    <chr>                     <int>
##  1 United Kingdom           245246
##  2 United States of America  35437
##  3 Australia                 21686
##  4 Ireland                   14827
##  5 United Arab Emirates      10235
##  6 Saudi Arabia               8951
##  7 Netherlands                8772
##  8 Switzerland                8678
##  9 Germany                    7941
## 10 Canada                     7894
## # ℹ 217 more rows

# Use the filter function to filter reviewers whose nationality is the United Kingdom
review <- review %>%
    filter(Reviewer_Nationality == "United Kingdom")

2 Data preprocessing

Depending on the research question, the variables will be pre-processed.

2-1 Data preprocessing of research on positive/negative term and ngram frequency

Firstly, it can be known from the data source website that in the two variables ‘Negative_Review’ and ‘Positive_Review’ , if the reviewer does not give the negative or positive review, then it should be: ‘No Positive’/‘No Negative’. So these two values should be filtered out in two variables before analysis.

# Filter values
review <- review %>%
    filter(!Negative_Review == "No Negative") %>%
    filter(!Positive_Review == "No Positive")

Tokenize the word of positive reviews and negative reviews, calculate the number of word occurrences, and then collect them into an object ‘review_word’. This step is to observe the number of occurrences of each word.

# Tokenize the word in negative reviews
negative_words <- review %>%
    select(Negative_Review) %>%
    unnest_tokens(word, Negative_Review) %>%
    anti_join(stop_words) %>%
    count(word, sort = T) %>%
    mutate(label = c("negative"))

## Joining with `by = join_by(word)`

# Tokenize the word in positive reviews
positive_words <- review %>%
    select(Positive_Review) %>%
    unnest_tokens(word, Positive_Review) %>%
    anti_join(stop_words) %>%
    count(word, sort = T) %>%
    mutate(label = c("positive"))

## Joining with `by = join_by(word)`

# Combine positive/negative review words
review_words <- negative_words %>%
    bind_rows(positive_words) %>%
    arrange(desc(n))

The paired words of positive reviews and negative reviews are tokenized using the ‘ngram’ method and assembled into an object ‘review_bigram’. This step is to observe which words occur frequently in pairs.

# Tokenize pairwise phrases in negative reviews
negative_bigram <- review %>% 
    select(Negative_Review) %>%
    unnest_tokens(bigram, Negative_Review, token = "ngrams", n = 2) %>%
    filter(!is.na(bigram)) %>%
    mutate(label = c("negative"))

# Tokenize pairwise phrases in positive reviews
positive_bigram <- review %>% 
    select(Positive_Review) %>%
    unnest_tokens(bigram, Positive_Review, token = "ngrams", n = 2) %>%
    filter(!is.na(bigram)) %>%
    mutate(label = c("positive"))

# Combine bigrams in negative/positive reviews
review_bigram <- negative_bigram %>%
    bind_rows(positive_bigram)

Number each review and separate words for positive and negative reviews. This step is to prepare for the next phi coefficient analysis of the data.

# Number each review
review <- review %>%
    mutate(code = c(1:170329))   # There's totally 170329 reviews

# Tokenize the word in negative reviews
negative_phi <- review %>%
    select(Negative_Review, code) %>%
    unnest_tokens(word, Negative_Review, drop = F) %>%
    anti_join(stop_words) %>%
    mutate(label = c("negative")) %>%
    rename(review = Negative_Review)

## Joining with `by = join_by(word)`

# Tokenize the word in positive reviews
positive_phi <- review %>%
    select(Positive_Review, code) %>%
    unnest_tokens(word, Positive_Review, drop = F) %>%
    anti_join(stop_words) %>%
    mutate(label = c("positive")) %>%
    rename(review = Positive_Review)

## Joining with `by = join_by(word)`

2-2 Data preprocessing of research on correlation between review sentiments and reviewer scores

In the research of the relationship between review sentiment scores and reviewer scores, it is necessary to tokenize the words in positive/negative reviews to support the calculation of sentiment scores for each review.

# Tokenize the word in negative reviews
negative_word_tidy <- review %>%
    select(Negative_Review, code, Reviewer_Score) %>%
    unnest_tokens(word, Negative_Review) %>%
    anti_join(stop_words) %>%
    mutate(label = c("negative"))

## Joining with `by = join_by(word)`

# Tokenize the word in positive reviews
positive_word_tidy <- review %>%
    select(Positive_Review, code, Reviewer_Score) %>%
    unnest_tokens(word, Positive_Review) %>%
    anti_join(stop_words) %>%
    mutate(label = c("positive"))

## Joining with `by = join_by(word)`

After completing the above loading, cleaning and pre-processing of the original data, the data becomes more concise and targeted. Next comes the actual data analysis.

Text data analysis

1-1 Word Frequency Analysis for Positive/Negative Reviews

Use the filter function and head function to list the top10 words that appear in negative reviews and positive reviews. In the negative reviews, it can be observed that ‘negative’ appears the most times, exceeding 120,000 times, which is more than 50,000 times more than the ‘hotel’ that the second most occurred. Among the top10 words, ‘breakfast’, ‘staff’, ‘bed’ and ‘bathroom’ related to hotel facilities and services all appeared. It shows that the United Kingdom Reviewers often explain the lack of hotel facilities and services when they make negative reviews.

review_words %>%
    filter(label == "negative") %>%
    head(10)

## # A tibble: 10 × 3
##    word          n label   
##    <chr>     <int> <chr>   
##  1 hotel     28004 negative
##  2 breakfast 27133 negative
##  3 staff     17255 negative
##  4 didn      14085 negative
##  5 bed       13445 negative
##  6 bit       12189 negative
##  7 bar       11918 negative
##  8 bathroom  11520 negative
##  9 night     11209 negative
## 10 shower     9564 negative

In the positive reviews, it can be observed that ‘staff’ and ‘location’ have the most occurrences, both exceeding 190,000 times. Among the top10 words with the frequency of occurrence, there are many adjectives, such as ‘friendly’, ‘helpful’, ‘nice’, ‘clean’, ‘excellent’ and ‘comfortable’. It shows that United Kingdom Reviewers adjectives with good meanings frequently when making positive reviews.

review_words %>%
    filter(label == "positive")

## # A tibble: 23,834 × 3
##    word            n label   
##    <chr>       <int> <chr>   
##  1 staff       70105 positive
##  2 location    61643 positive
##  3 hotel       39777 positive
##  4 friendly    31031 positive
##  5 breakfast   27735 positive
##  6 helpful     27146 positive
##  7 clean       24408 positive
##  8 excellent   23279 positive
##  9 comfortable 23024 positive
## 10 bed         22812 positive
## # ℹ 23,824 more rows

The tokenized words are calculated in the form of log odds ratio. The reason for using log odds ratio is that instead of simply calculating the frequency of each word in the review, it can reduce the weight of words that are likely to appear in both positive and negative reviews at the same time, so that it can be used to compare positive/negative words that are more likely to appear in the comments respectively. In this report, the higher the log ratio, the more likely words are grouped in negative reviews. On the contrary, the lower the log ratio, the more likely to appear in positive comments. According to the following data, the 6 words with the highest log ratio are ‘negative’, ‘mouldy’, ‘debited’, ‘disconnecting’, ‘unstable’ and ‘machinery’. It means that these words often appear in negative comments. The six words with the lowest log ratio are ‘dame’, ‘theatres’, ‘naschmarkt’, ‘stephens’, ‘aerobus’ and ‘positive’. This means that these words appear frequently in positive reviews.

word_ratios <- review_words %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%   # Select words with more than 10 occurrences for analysis
  ungroup() %>%
  pivot_wider(names_from = label, values_from = n, values_fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(negative / positive)) %>%
  arrange(desc(logratio))

word_ratios %>% head()

## # A tibble: 6 × 4
##   word          positive  negative logratio
##   <chr>            <dbl>     <dbl>    <dbl>
## 1 mouldy     0.000000785 0.0000808     4.63
## 2 torn       0.000000785 0.0000656     4.43
## 3 teeth      0.000000785 0.0000519     4.19
## 4 unreliable 0.000000785 0.0000512     4.18
## 5 flushed    0.000000785 0.0000476     4.11
## 6 flushing   0.000000785 0.0000476     4.11

word_ratios %>% tail()

## # A tibble: 6 × 4
##   word        positive    negative logratio
##   <chr>          <dbl>       <dbl>    <dbl>
## 1 freindly   0.0000824 0.00000216     -3.64
## 2 lush       0.0000557 0.00000144     -3.65
## 3 lication   0.0000290 0.000000721    -3.70
## 4 spotlessly 0.000477  0.0000101      -3.86
## 5 aerobus    0.0000353 0.000000721    -3.89
## 6 theatres   0.000321  0.00000649     -3.90

1-2 Paired phrase Frequency Analysis for Positive/Negative Reviews

Use the count function to initially observe the number of bigrams in positive and negative reviews. It can be seen that the largest number of bigrams basically contain commonly used meaningless words.

review_bigram %>%
    count(bigram, sort = T) %>%
    head()

## # A tibble: 6 × 2
##   bigram        n
##   <chr>     <int>
## 1 in the    40153
## 2 the room  38002
## 3 the hotel 29365
## 4 room was  29241
## 5 and the   25289
## 6 of the    24489

So after separating the two words of bigram, use the filter function to filter out the stop words in word1 and word2 respectively. The reason for screening in two steps here is to better filter out all the stop words of the two words in the bigram. Then, after performing the count operation, it can be found that the meaningless phrases have been removed from the most frequent bigrams, and the rest are basically combinations of adjectives and nouns. Through the data display, it is found that the phrases appearing more times are basically positive.

bigrams_separated <- review_bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigrams_separated %>% 
    count(word1, word2, sort = TRUE) %>%
    head()

## # A tibble: 6 × 3
##   word1       word2        n
##   <chr>       <chr>    <int>
## 1 friendly    staff     8343
## 2 helpful     staff     5920
## 3 comfy       bed       3835
## 4 excellent   location  3766
## 5 comfortable bed       3604
## 6 reception   staff     3306

Then combine the words which stop words have been removed, and use the tf-idf to calculate the occurrence frequency and weighted occurrence frequency of bigram. The reason for using tf-idf here is to find out the bigrams that appear frequently but are not common. This is the way to observe the special words used by the United Kingdom reviewers in the review. Compared with ordinary word frequency analysis, the analysis of tf-idf is more unique. According to the following data, the higher tf-idf values are the bigrams that appear in the positive evaluation. It can be surmised that the United Kingdom reviewers prefer positive comments to negative ones.

bigrams_united <- bigrams_separated %>%
  unite(bigram, word1, word2, sep = " ")

bigram_tf_idf <- bigrams_united %>%
  count(label, bigram) %>%
  bind_tf_idf(bigram, label, n) %>%
  arrange(desc(tf_idf))
bigram_tf_idf %>% head()

## # A tibble: 6 × 6
##   label    bigram                   n       tf   idf   tf_idf
##   <chr>    <chr>                <int>    <dbl> <dbl>    <dbl>
## 1 positive location easy          335 0.000743 0.693 0.000515
## 2 positive location comfortable   303 0.000672 0.693 0.000466
## 3 positive easy check             248 0.000550 0.693 0.000381
## 4 positive bed lovely             216 0.000479 0.693 0.000332
## 5 positive excellent friendly     188 0.000417 0.693 0.000289
## 6 positive bed clean              187 0.000415 0.693 0.000288

In addition to the bigram analysis of the review data, this report also analyzes the phi coefficients. The phi coefficients analysis is to study which words appear more often in the same comment. The reason for using phi coefficients instead of co-occurrence frequency is that phi coefficients research can identify closely related words better than co-occurrence frequency research. The phi coefficients research can connect highly related words, the word clusters are clearly visible, and the structure is simple.

In the analysis of phi coefficients in this report, due to the huge sample data, words with more than 2000 occurrences were screened for analysis. As can be observed in the simple data presentation below, the word correlation data in positive reviews is higher than in negative reviews. It shows that the words in positive reviews are more closely connected, and the British prefer to use idioms in positive reviews.

review_cors_negative <- negative_phi %>%
  add_count(word) %>%
  filter(n >= 2000) %>%
  pairwise_cor(item = word, 
               feature = code, 
               sort = T)
review_cors_negative %>% head()

## # A tibble: 6 × 3
##   item1        item2        correlation
##   <chr>        <chr>              <dbl>
## 1 conditioning air                0.654
## 2 air          conditioning       0.654
## 3 con          air                0.623
## 4 air          con                0.623
## 5 tea          coffee             0.566
## 6 coffee       tea                0.566

review_cors_positive <- positive_phi %>%
  add_count(word) %>%
  filter(n >= 2000) %>%
  pairwise_cor(item = word, 
               feature = code, 
               sort = T)
review_cors_positive %>% head()

## # A tibble: 6 × 3
##   item1    item2    correlation
##   <chr>    <chr>          <dbl>
## 1 distance walking        0.830
## 2 walking  distance       0.830
## 3 coffee   tea            0.532
## 4 tea      coffee         0.532
## 5 staff    friendly       0.504
## 6 friendly staff          0.504

2 Correlation research of review sentiment scores and hotel reviewer scores

In the research of the correlation between review sentiment and reviewer scores, the word tokenization data of negative reviews and positive reviews are first concatenated, and the number of occurrences of each word in each sentiment review is counted. The sentiment score of each review is calculated by the pivot_wider function. Next use the left_join function to combine the reviewer score for each review with the sentiment score. This allows for a comparative study of how sentiment scores relate to reviewer scores.

From the following data, it can be simply observed that reviewers with high sentiment scores basically give high hotel scores above 7 points, while reviewers with low sentiment scores mostly give low hotel scores below 5 points. However, it can be observed that this rule is not completely established, and some reviewers with low sentiment scores also gave high hotel scores of 9.2. Then it can be speculated that reviewers with high sentiment scores are indeed likely to post higher hotel reviewer scores. There is a certain correlation between sentiment scores and reviewer scores, but whether it is a complete proportional relationship needs to further exclude extreme values and outliers to research again.

sentiment <- negative_word_tidy %>% 
    bind_rows(positive_word_tidy) %>%
    count(word, label, code) %>%
    pivot_wider(names_from = label, values_from = n) %>% 
    mutate(sentiment = positive - negative) %>%
    filter(!is.na(sentiment))

review_score <- review %>%
    select(Reviewer_Score, code)
    
sentiment_sum <- sentiment %>%
    group_by(code) %>%
    summarise(sentiment_sum = sum(sentiment)) %>%
    filter(!sentiment_sum == 0) %>%
    left_join(review_score, by = "code")

sentiment_sum %>%
    arrange(desc(sentiment_sum), desc(Reviewer_Score)) %>%
    head()

## # A tibble: 6 × 3
##     code sentiment_sum Reviewer_Score
##    <int>         <int>          <dbl>
## 1  95856            19            9.6
## 2 140357            15           10  
## 3 127193            12           10  
## 4  68347            11            7.9
## 5  35040            10            9.6
## 6 112550             9            9.6

sentiment_sum %>%
    arrange(desc(sentiment_sum), desc(Reviewer_Score)) %>%
    tail()

## # A tibble: 6 × 3
##     code sentiment_sum Reviewer_Score
##    <int>         <int>          <dbl>
## 1  23353            -8            3.8
## 2 147465            -8            3.3
## 3 143187            -9            5  
## 4  39176           -10            9.2
## 5 134993           -10            9.2
## 6 121588           -12            3.3

Individual analysis and figures

Anaysis and Figure 1-1, 1-2

The reason for choosing to use pictures in the form of word clouds for analysis is that word clouds can most intuitively show the most frequently occurring words in positive/negative reviews. Larger words represent more occurrences. Putting the word cloud map in the first step of analysis can lay the foundation for other word frequency analysis.

# 1-1 Positive Review WordCloud
review_words %>%
    filter(label == "positive") %>%
    with(wordcloud(word, n, max.words = 100)) %>%
    title("1-1 Positive Review WordCloud map")

As can be observed in Figure 1-1, the words that most often appear in positive reviews are nouns such as ‘staff’, ‘location’, ‘hotel’, etc.

# 1-2 Negative Review WordCloud
review_words %>%
    filter(label == "negative") %>%
    with(wordcloud(word, n, max.words = 100)) %>%
    title("1-2 Negative Review WordCloud map")

It can be observed in Figure 1-2 that the words that most often appear in negative reviews are nouns such as ‘breakfast’, ‘staff’, ‘hotel’, etc.

Through the word cloud map analysis in the first step, we cannot distinguish the difference between the words used by the United Kingdom reviewers in positive reviews and negative reviews. Therefore, other methods of word frequency analysis are needed.

Anaysis and Figure 2

The reason for choosing a bar graph with a log odds ratio in Figure 2 is to distinguish words that appear more frequently in positive/negative reviews. The biggest difference between this and general word frequency analysis is that log odds ratio can exclude high-frequency words that often appear in two comment areas at the same time, making the analysis more specific. Use bar graph to clearly show the top10 words in positive/negative reviews.

# 2 Bar graph with log odds ratio
word_ratios %>%
  group_by(logratio < 0) %>%
  slice_max(abs(logratio), n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ggtitle("Top 10 Distinctive Words in Negative/Positive Reviews with log odds ratio") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15, color = "darkred")) +
  ylab("log odds ratio (Negative Review Words / Positive Review Words)")

According to the Figure 2-1, the words that most often appear in negative reviews and rarely appear in positive reviews are ‘moudly’, ‘torn’, ‘teeth’, ‘unreliable’, ‘humming’, ‘flushing’, ‘flushed’, ‘powdered’, ‘patchy’ and ‘leaked’. Words that appear most frequently in positive reviews and rarely in negative reviews are ‘theatres’, ‘aerobus’, ‘spotlessly’, ‘lication’, ‘lush’, ’ friendly’, ‘museums’, ‘wonderfully’, ‘chocs’ and ‘divine’. It can be speculated that the United Kingdom reviewers use more adjectives in negative reviews to describe the lack of hotel facilities. When making good reviews, they pay more attention to the surrounding environment of the hotel.

Anaysis and Figure 3

Figure 3 chooses to use the bar graph of tf-idf to present the bigrams that most often appear in positive/negative reviews. The reason is that separating the commonly used bigrams of positive reviews and negative reviews can show the difference between the two more clearly, which is a little helpful for readers to read the data. It can be seen from the previous data chart analysis that it is not enough to analyze the word frequency of only one word, and it is easy to produce obstacles that disturb the results. Therefore, the tf-idf method is used for bigram analysis. On the one hand, it can grasp which bigrams frequently appear in positive and negative reviews, and on the other hand, it can find out the bigrams that appear frequently but are not commonly used words, and perform sentiment analysis more accurately.

# 3 Bigram bar graph with tf-idf
bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  group_by(label) %>%
  slice_max(tf_idf, n = 10) %>%
  ungroup() %>%
  mutate(bigram = reorder(bigram, tf_idf)) %>%
  ggplot(aes(tf_idf, bigram, fill = label)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ label, ncol = 2, scales = "free") +
  ggtitle("Top 10 Common Words in Negative/Positive Reviews with tf-idf") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15, color = "darkred")) +
  labs(x = "tf-idf of bigram", y = NULL)

From the chart above, it can be observed that among the top10 bigrams that appear most often in negative comments, United Kingdom reviewers are often dissatisfied with slow service. In the top10 bigrams that appear most frequently in positive reviews, it can be inferred that the United Kingdom reviewers do not have high requirements for hotel locations, and they like to use degree adverbs in positive reviews, which has already shown their satisfaction with the hotel.

Anaysis and Figure 4-1, 4-2

Figures 4-1 and 4-2 choose to show the connection between bigram words with a network map. Because this can comprehensively and intuitively clearly show which words are most often used together. The arrows in the figure below indicate the direction in which one word connects to another. The darker the color of the arrow, the more the number of connections, and the higher the frequency of occurrence before and after.

# filter for only relatively common combinations
negative_bigram_counts <- bigrams_separated %>% 
  filter(label == "negative") %>% 
  count(word1, word2, sort = TRUE)

negative_bigram_graph <- negative_bigram_counts %>%
  filter(n > 500) %>%
  graph_from_data_frame()

# 4-1 Bigram network graph from negative reviews
set.seed(0720)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(negative_bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "red", size = 4) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() +
  ggtitle("Network Graph of Common Bigrams in Negative Reviews") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15, color = "darkred"))

## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Figure 4-1 is a bigram network map that often appears in negative reviews. According to the chart, the most frequently occurring bigrams are ‘air con’ and ‘air conditioning’. Among these, words such as ‘bar’, ‘coffee’, and ‘star’ have more connections with other words.

# filter for only relatively common combinations
positive_bigram_counts <- bigrams_separated %>% 
  filter(label == "positive") %>% 
  count(word1, word2, sort = TRUE)

positive_bigram_graph <- positive_bigram_counts %>%
  filter(n > 500) %>%
  graph_from_data_frame()

# 4-2 Bigram network graph from negative reviews
set.seed(2007)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(positive_bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "red", size = 4) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() +
  ggtitle("Network Graph of Common Bigrams in Positive Reviews") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15, color = "darkred"))

Figure 4-2 is a bigram network map that often appears in positive reviews. According to the figure, the most frequently occurring words are ‘friendly’ and ‘lovely’. Both words are connected to many other words. So it is speculated that the United Kingdom reviewers like to use these two words to describe their feelings and other items.

It can be seen from the above two network maps that there is a big difference in the distribution density. Among them, the word distribution density of positive reviews is much higher than that of negative reviews. This is because in order to control variables during data selection, bigrams with more than 500 occurrences were selected for analysis. The number of positive reviews that appear more than 500 times is much higher than that of negative reviews, which means that United Kingdom reviewers are more likely to use habitual words in positive reviews when reviewing, while they have high specificity in negative reviews.

Anaysis and Figure 5-1, 5-2

Figure 5-1 and 5-2 choose to use the network map of Phi Coefficients. The reason is that the network map with the Phi Coefficients can not only observe which words are closely related, but also judge the degree of connection, frequency of occurrence, and block division according to the size and color of the dots in the map and the thickness of the connecting lines. Therefore, Phi Coefficients analysis was carried out for negative and positive reviews respectively.

# 5-1 Network graph with Phi Coefficients in Negative Reviews
set.seed(1001)
graph_cors_negative <- review_cors_negative %>%
  filter(correlation >= 0.15) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),
         group = as.factor(group_infomap()))

ggraph(graph_cors_negative, layout = "fr") +

  geom_edge_link(color = "gray50",
                 aes(edge_alpha = correlation, 
                     edge_width = correlation),
                 show.legend = F) +             
  scale_edge_width(range = c(1, 4)) +           

  geom_node_point(aes(size = centrality,
                      color = group),
                  show.legend = F) +
  scale_size(range = c(5, 10)) +

  geom_node_text(aes(label = name),
                 repel = T,
                 size = 5) +

  theme_graph() +
  ggtitle("Network Graph with Phi Coefficients in Negative Reviews") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15, color = "darkred"))

Figure 5-1 is the Phi Coefficients analysis in the negative reviews. According to the map, the most frequently occurring words are divided into 20 communities according to the color. The most closely related are ‘air’ and ‘con/conditioning’, and the most common groups are ‘shower-water-bath-bathroom’ and ‘comfortable-beds-single’. It can be inferred that the United Kingdom reviewers have certain requirements for room facilities and like to make negative reviews on such content as afternoon tea, hotel star rating, and hotel services.

# 5-2 Network graph with Phi Coefficients in Positive Reviews
set.seed(1002)
graph_cors_positive <- review_cors_positive %>%
  filter(correlation >= 0.15) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),
         group = as.factor(group_infomap()))

ggraph(graph_cors_negative, layout = "fr") +

  geom_edge_link(color = "gray50",
                 aes(edge_alpha = correlation,
                     edge_width = correlation),
                 show.legend = F) +           
  scale_edge_width(range = c(1, 4)) +       

  geom_node_point(aes(size = centrality,
                      color = group),
                  show.legend = F) +
  scale_size(range = c(5, 10)) +

  geom_node_text(aes(label = name),
                 repel = T,
                 size = 5) +

  theme_graph() +
  ggtitle("Network Graph with Phi Coefficients in Positive Reviews") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15, color = "darkred"))

Figure 5-2 is the Phi Coefficients analysis in the positive reviews. According to the map, the most frequently occurring words are divided into 20 communities according to the color. The most closely related ones are ‘air’, and ‘con/conditioning’, and the most common groups are ‘booked-beds-single’ and ‘coffee-tea-facilities’. Compared with the analysis of the above negative Phi Coefficients, it can be found that there is no big difference. That is to say, United Kingdom reviewers will use these idiomatic phrases in both positive and negative reviews, and there is no particularity.

Anaysis and Figure 6

Finally, Figure 6 chooses a point graph to conduct a correlation study between the reviewer’s emotional score and the reviewer hotel score. The reason for using a point graph is that it can clearly observe the distribution of points. This makes it easier for readers to grasp the distribution of high and low sentiment scores across different hotel scores.

# 6 Correlation point graph of Reviewer scores and Sentiment scores
sentiment_sum %>%
    ggplot(aes(x = Reviewer_Score, y = sentiment_sum)) + 
    geom_point() +
    ggtitle("Correlation Point Graph of Reviewer Scores and Sentiment Scores") +
    theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15, color = "darkred"))

According to the figure, reviewers with extremely high sentiment scores are basically distributed in the high hotel review score area, while reviewers with low sentiment scores are not often distributed in the low segment of hotel scores. This shows that reviewers with high sentiment scores have a certain relationship with the hotel scores they give, but there is no fixed proportional relationship. The higher the sentiment score of the reviewers, the more likely it will affect their review scores of the hotel.

Summary

To sum up, for the two research questions raised at the beginning of this report, the following results are summarized:

The United Kingdom reviewers use adverbs of degree more to modify their language when they review on hotels, most often use ‘lovely’, which is in line with people’s impression of United Kingdom gentlemen. And they pay attention to the surrounding environment of the hotel and the location of the hotel, and do not have high requirements for these two.
United Kingdom reviewers pay more attention to hotel facilities and service attitude when making negative reviews. They don’t often use idiomatic words in negative reviews, but make evaluations more specifically.
United Kingdom reviewers may be sentimentally affected when scoring, but the degree of influence is not very large.

Word Frequency Research of Sentiment Reviews by United Kingdom Reviewers

조가흔

2023-6-13