Loading Necessary Packages

library(tidyverse)
library(tidytext)
library(widyr)
library(tidyr)
library(tidygraph)
library(ggraph)
library(scales)
data(stop_words)
library('showtext')
font_add_google('Noto Sans KR', 'notosanskr')
showtext_auto()

disney_review <- read.csv("data/DisneylandReviews.csv")

1. Introduction

1) Executive Summary

Customer reviews are critical for a business. It’s an essential factor in a customer’s decision-making process, therefore negative reviews can hurt business reputation, customer loyalty, and sales. But also a business can flip this around and use negative reviews as opportunity and feedback to resolve issues. This is where text analysis(TA) comes in handy. By using TA, we would be able to monitor the volume and sentiment of customer reviews and provide solutions more efficiently.

Therefore with this project on the following database, I will be analyzing reviews for both Disneyland California branch and Paris branch, focusing on terms with negative sentiments using the “Bing” sentiment lexicon. The reason why I chose “Bing” was because of its binary trait, and also since the ratio of negative to positive words in the lexicon is higher than ‘nrc’, it would be easier to detect negative words.

First, I will compare the negative term frequencies of each branch(figure1), and figure out which negative terms were frequently brought up. Then compare the tf-idf(figure2) to decipher certain negative terms that are only brought up in reviews of a specific branch. Lastly by utilizing this, I will generate a network graph employing phi coefficient to identify words that are relatively highly related to specific terms possessing high tf-idf values.

2) Data Background & Summary

This dataset is public domain which includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor.

There are 6 columns in this dataset which are:

Review_ID: unique id given to each review

Rating: ranging from 1 (unsatisfied) to 5 (satisfied)

Year_Month: when the reviewer visited the theme park

Reviewer_Location: country of origin of visitor

Review_Text: comments made by visitor Disneyland_Branch: location of Disneyland Park

However, for this project I will only use the ‘Review_Text’, and ‘Disney_Branch’ column, since other columns aren’t relevant information for TA.

head(disney_review)
##   Review_ID Rating Year_Month    Reviewer_Location
## 1 670772142      4     2019-4            Australia
## 2 670682799      4     2019-5          Philippines
## 3 670623270      4     2019-4 United Arab Emirates
## 4 670607911      4     2019-4            Australia
## 5 670607296      4     2019-4       United Kingdom
## 6 670591897      3     2019-4            Singapore
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Review_Text
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If you've ever been to Disneyland anywhere you'll find Disneyland Hong Kong very similar in the layout when you walk into main street! It has a very familiar feel. One of the rides  its a Small World  is absolutely fabulous and worth doing. The day we visited was fairly hot and relatively busy but the queues moved fairly well. 
## 2 Its been a while since d last time we visit HK Disneyland .. Yet, this time we only stay in Tomorrowland .. AKA Marvel land!Now they have Iron Man Experience n d Newly open Ant Man n d Wasp!!Ironman .. Great feature n so Exciting, especially d whole scenery of HK (HK central area to Kowloon)!Antman .. Changed by previous Buzz lightyear! More or less d same, but I'm expecting to have something most!!However, my boys like it!!Space Mountain .. Turns into Star Wars!! This 1 is Great!!!For cast members (staffs) .. Felt bit MINUS point from before!!! Just dun feel like its a Disney brand!! Seems more local like Ocean Park or even worst!!They got no SMILING face, but just wanna u to enter n attraction n leave!!Hello this is supposed to be Happiest Place on Earth brand!! But, just really Dont feel it!!Bakery in Main Street now have more attractive delicacies n Disney theme sweets .. These are Good Points!!Last, they also have Starbucks now inside the theme park!!
## 3                                 Thanks God it wasn   t too hot or too humid when I was visiting the park   otherwise it would be a big issue (there is not a lot of shade).I have arrived around 10:30am and left at 6pm. Unfortunately I didn   t last until evening parade, but 8.5 hours was too much for me.There is plenty to do and everyone will find something interesting for themselves to enjoy.It wasn   t extremely busy and the longest time I had to queue for certain attractions was 45 minutes (which is really not that bad).Although I had an amazing time, I felt a bit underwhelmed with choice of rides and attractions. The park itself is quite small (I was really expecting something grand   even the main castle which was closed by the way was quite small).The food options are good, few coffee shops (including Starbucks) and plenty of gift shops. There was no issue with toilets as they are everywhere.All together it was a great day out and I really enjoyed it.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      HK Disneyland is a great compact park. Unfortunately there is quite a bit of maintenance work going on at present so a number of areas are closed off (including the famous castle) If you go midweek, it is not too crowded and certainly no where near as bus as LA Disneyland. We did notice on this visit that prices for food, drinks etc have really gone through the roof so be prepared to pay top dollar for snacks (and avoid the souvenir shops if you can) Regardless, kids will love it.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        the location is not in the city, took around 1 hour from Kowlon, my kids like disneyland so much, everything is fine.   but its really crowded and hot in Hong Kong
## 6     Have been to Disney World, Disneyland Anaheim and Tokyo Disneyland but I feel that Disneyland Hong Kong is really too small to be called a Disneyland. It has way too few rides and attractions. Souvenirs, food and even entrance tickets are slightly more expensive than other Disneyland as well. Basically, this park is good only for small children and people who has never been to Disney. The food choices were acceptable, mostly fast food, and not too expensive. Bottled water, however, was VERY expensive but they do have water fountains around for you to refill your water bottles. The parade was pretty good. It was crowded not a problem but what was the problem was the people were just so rude, the pushing and shoving cutting in lines for the rides, gift shops, food stands was just to much to take. forget trying to see one of the shows its a free for all for seats, i don't see how Disney can let this happen, it was by far the worst managed Disney property.
##                Branch
## 1 Disneyland_HongKong
## 2 Disneyland_HongKong
## 3 Disneyland_HongKong
## 4 Disneyland_HongKong
## 5 Disneyland_HongKong
## 6 Disneyland_HongKong

Additionally, I have chosen Disneyland California branch and Paris branch for comparison, as they are the top two branches with high review counts.

disney_review %>%
  count(Branch, sort = T)
##                  Branch     n
## 1 Disneyland_California 19406
## 2      Disneyland_Paris 13630
## 3   Disneyland_HongKong  9620

3) Data Cleaning

In order to visualize the data efficiently, the following data cleaning process would be necessary for all three figures:

- Extracting ‘Review_Text’, and ‘Disney_Branch’ column from the data (‘select()’)

- Tokenization of ‘Review_Text’ column into a separate ‘word’ column (‘unnest_tokens()’)

- Removal of stop words (‘anti_join()’)

- Adding a separate ‘sentiment’ column by matching rows based on the keys, using ‘Bing’ lexicon (‘inner_join()’)

- Extracting rows with ‘negative’ sentiments (‘filter()’)

- Extracting rows from ‘Disneyland_California’, and ‘Disneyland_Paris’ branch (‘filter()’)

2. Individual Figures

1) Negative Term Frequency Comparison (Jitter Plot)

In order to compare the term frequencies for both branches, the use of a jitter plot seemed suitable as the aesthetic. Also by utilizing ‘geom_abline()’ which creates a y=x graph on the plot, we can figure out that certain words near the line are used with about equal frequencies for each branch review, while words far away from the line are used much more by one specific branch compared to the other. Therefore can obtain more information regarding term frequency compared to simple box plots.

After the initial data cleaning process, I created an object named ‘negative_review_tf’ which counts how many times a word with a negative sentiment appears for each branch review. This is necessary for creating another object ‘cnp_frequency’, which calculates and provides the frequency of each word used in both branch reviews (frequency = n / total). Here, ‘left_join()’ was used to add a column of the total number of words appearing in each category, and ‘pivot_wider()’ was used to make a differently shaped data frame since I wanted to plot the frequencies on the x- and y-axes of a plot. Then finally, created the plot by using ‘geom_jitter()’ and ‘geom_abline()’.

bing <- get_sentiments("bing")

review_tidy <- disney_review %>%
  select(Review_Text, Branch) %>%
  unnest_tokens(input = Review_Text,
                output = word,
                drop = T) %>%
  anti_join(stop_words)

frequency <- review_tidy %>%
  inner_join(bing) %>%
  count(Branch, word, sentiment, sort = T) %>%
  group_by(Branch)

negative_review_tf <- frequency %>%
  filter(sentiment == "negative") 
cnp_frequency <- negative_review_tf %>%
  filter(Branch == c("Disneyland_California", "Disneyland_Paris")) %>%
  left_join(negative_review_tf %>%
              count(Branch, name = "total")) %>%
  mutate(freq = n/total) %>%
  select(Branch, word, freq) %>%
  pivot_wider(names_from = Branch, values_from = freq) %>%
  arrange(Disneyland_California, Disneyland_Paris) 

head(cnp_frequency)
## # A tibble: 6 × 3
##   word           Disneyland_California Disneyland_Paris
##   <chr>                          <dbl>            <dbl>
## 1 aggressiveness              0.000555         0.000524
## 2 bores                       0.000555         0.000524
## 3 detracting                  0.000555         0.000524
## 4 distorted                   0.000555         0.000524
## 5 exclusion                   0.000555         0.000524
## 6 fidget                      0.000555         0.000524
cnp_tf_graph <- ggplot(cnp_frequency, aes(x = Disneyland_California, y = Disneyland_Paris)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "blue") +
  labs(title = "Fig 1. Negative Term Frequency Comparison Between California and Paris Branch Reviews") + 
  theme(plot.title = element_text(size = 15), text = element_text(family = 'notosanskr')) 

cnp_tf_graph

2) Comparison of High Tf-Idf words with Negative Sentiments (Bar Plot)

In this sector, I wanted to do a simple comparison of top 10 words with negative sentiments used in each branch review with the highest tf-idf value using a bar graph. Unlike simply comparing term frequencies like figure 1, this will provide more insight on certain negative terms that holds more weight in a specific branch review.

For plotting figure 2, the use of ‘bind_tf_idf()’ was necessary which automatically creates columns for tf, idf, and tf-idf values of each word. When visualizing, I have faceted the bar graph into two for each branch, and arranged top 10 tf-idf words in a descending order for easier comprehension.

cnp_negative_review_tf <- negative_review_tf %>%
  filter(Branch == c("Disneyland_California", "Disneyland_Paris"))
  

total_words <- cnp_negative_review_tf %>%
  group_by(Branch) %>%
  summarize(total = sum(n))

cnp_words <- left_join(cnp_negative_review_tf, total_words)
  
head(cnp_words)
## # A tibble: 6 × 5
## # Groups:   Branch [2]
##   Branch                word    sentiment     n total
##   <chr>                 <chr>   <chr>     <int> <int>
## 1 Disneyland_California crowded negative   2596 19845
## 2 Disneyland_Paris      bad     negative   1245 22695
## 3 Disneyland_California bad     negative   1112 19845
## 4 Disneyland_California break   negative    954 19845
## 5 Disneyland_Paris      cold    negative    937 22695
## 6 Disneyland_Paris      poor    negative    764 22695
disney_tf_idf <- cnp_words %>%
  bind_tf_idf(word, Branch, n) 

head(disney_tf_idf)
## # A tibble: 6 × 8
## # Groups:   Branch [2]
##   Branch                word    sentiment     n total     tf   idf tf_idf
##   <chr>                 <chr>   <chr>     <int> <int>  <dbl> <dbl>  <dbl>
## 1 Disneyland_California crowded negative   2596 19845 0.131  0.693 0.0907
## 2 Disneyland_Paris      bad     negative   1245 22695 0.0549 0     0     
## 3 Disneyland_California bad     negative   1112 19845 0.0560 0     0     
## 4 Disneyland_California break   negative    954 19845 0.0481 0.693 0.0333
## 5 Disneyland_Paris      cold    negative    937 22695 0.0413 0.693 0.0286
## 6 Disneyland_Paris      poor    negative    764 22695 0.0337 0     0
disney_tf_idf %>%
  group_by(Branch) %>%
  slice_max(tf_idf, n = 10) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, reorder_within(word, tf_idf, Branch), fill = Branch)) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() + 
  facet_wrap(~Branch, ncol = 2, scales = "free") +
  labs(title = "Fig 2. Top 10 Negative Words from California and Paris Branch Reviews Using tf-idf",
         x = "tf-idf", y = NULL) + 
  theme(plot.title = element_text(size = 15), text = element_text(family = 'notosanskr'))

3) Semantic Network Analysis Based on Phi Coefficients for Top 10 Negative Terms with Highest Tf-Idf Value (Network Graph)

In this part, the goal was to do a semantic network analysis by creating a network graph with phi coefficients to see highly relevant words around terms with high tf-idf values (utilizing figure 2), and understand the context. I created two separate tidy text data, and network graphs for each California and Paris branch for better grasp in comparison.

In tidying up text data for the network graph, the use of ‘pairwise_cor()’ was necessary to find the phi coefficient. Then, by setting up a target object which contains the vector of terms I want to analyze, removed unnecessary information. To ensure that the network is not too complex and is built around highly relevant words, I extracted terms that appear more than 200 times in a branch review, and filtered out word pairs with correlation values lower than 0.09. Also adding network centrality and communities in separate columns was needed to create the plot.

When plotting, I was able to tweak the edge and node color, and their size based on correlation and centrality to improve the visualization process.

cali_word_cors <- disney_review %>%
  filter(Branch == c("Disneyland_California")) %>%
  unnest_tokens(word, Review_Text, drop = T) %>%
  anti_join(stop_words) %>%
  add_count(word) %>%
  filter(n >= 200) %>%
  pairwise_cor(item = word, 
               feature = Review_ID, 
               sort = T) 
head(cali_word_cors)
## # A tibble: 6 × 3
##   item1    item2    correlation
##   <chr>    <chr>          <dbl>
## 1 jones    indiana        0.951
## 2 indiana  jones          0.951
## 3 springs  radiator       0.921
## 4 radiator springs        0.921
## 5 terror   tower          0.911
## 6 tower    terror         0.911
target1 <- c("crowded", "break", "broke", "ridiculous", "cheap", "sad", "broken", "scary", "nightmare", "dissapoint")

set.seed(1234)
cali_word_cors %>%
   filter(item1 %in% target1) %>%
  filter(correlation >= 0.09) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),
         group = as.factor(group_infomap())) %>%
  ggraph(layout = "fr") +

  geom_edge_link(color = "gray50",
                 aes(edge_alpha = correlation,   
                     edge_width = correlation),  
                 show.legend = F) +              
  scale_edge_width(range = c(1, 3)) +            

  geom_node_point(aes(size = centrality,
                      color = group),
                  show.legend = F) +
  scale_size(range = c(5, 10)) +

  geom_node_text(aes(label = name),
                 repel = T,
                 size = 6) +

  theme_graph()+
  labs(title = "Fig 3-1. Network Graph Based on Phi Coefficients for Top 10 Negative Terms Mentioned in Reviews for Disneyland California Branch") + 
  theme(plot.title = element_text(size = 15), text = element_text(family = 'notosanskr'))

paris_word_cors <- disney_review %>%
  filter(Branch == c("Disneyland_Paris")) %>%
  unnest_tokens(word, Review_Text, drop = T) %>%
  anti_join(stop_words) %>%
  add_count(word) %>%
  filter(n >= 200) %>%
  pairwise_cor(item = word, 
               feature = Review_ID, 
               sort = T) 
head(paris_word_cors)
## # A tibble: 6 × 3
##   item1   item2   correlation
##   <chr>   <chr>         <dbl>
## 1 jones   indiana       0.935
## 2 indiana jones         0.935
## 3 bay     newport       0.917
## 4 newport bay           0.917
## 5 phantom manor         0.879
## 6 manor   phantom       0.879
target2 <-  c("cold", "pan", "tired", "terror", "limited", "wrong", "issue", "stunt", "issues", "dark")

set.seed(1234)
paris_word_cors %>%
   filter(item1 %in% target2) %>%
  filter(correlation >= 0.09) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),
         group = as.factor(group_infomap())) %>%
  ggraph(layout = "fr") +

  geom_edge_link(color = "gray50",
                 aes(edge_alpha = correlation,   
                     edge_width = correlation),  
                 show.legend = F) +              
  scale_edge_width(range = c(1, 3)) +            

  geom_node_point(aes(size = centrality,
                      color = group),
                  show.legend = F) +
  scale_size(range = c(5, 10)) +

  geom_node_text(aes(label = name),
                 repel = T,
                 size = 6) +

  theme_graph()+
  labs(title = "Fig 3-2. Network Graph Based on Phi Coefficients for Top 10 Negative Terms Mentioned in Reviews for Disneyland Paris Branch") + 
  theme(plot.title = element_text(size = 15), text = element_text(family = 'notosanskr'))

3. Total Summary & Results

First, in figure 1, I compared term frequencies for Disneyland California branch, and Paris branch by creating a jitter plot. In the graph, we can see that certain words like ‘peril’, ‘smoke’, ‘mania’, ‘hollow’ are further away from the y=x graph which was set as a standard line. This indicates that these words are used much more frequently by one specific branch compared to the other.

In figure 2, I was able to compare the top 10 words with high tf-idf values. This provides contextual information of each branch. For example, ‘crowded’, and ‘break’ are the top 2 words with high tf-idf for the California branch. With this information, we can assume that visitors had complaints about it being too crowded and things breaking, and that this is a characteristic specific to the California branch.

Lastly in figure 3, I created two network graphs for both branches based on phi coefficients for terms from figure 2. With this result, we can obtain a more detailed insight on the negative terms and why they were used. For instance, for the Paris branch, one of the terms with high tf-idf value was ‘limited’. However with only this data, we don’t know specifically what the visitors were dissatisfied about. But through the network graph, we can understand that the word ‘limited’ is closely related to ‘options’, ‘choice’, and ‘food’. This conveys a story that probably Disneyland Paris branch had limited options of food.

Customer feedback plays a major role in business development. Therefore I think for a business to grow, the active use of TA, like in this project, in customer interaction is inevitable. With negative reviews, a business can come up with solutions and compensations quickly and effectively, and with positive reviews, a business can understand their strong points and reinforce them.