library(tidyverse)
library(tidytext)
library(widyr)
library(tidyr)
library(tidygraph)
library(ggraph)
library(scales)
data(stop_words)
library('showtext')
font_add_google('Noto Sans KR', 'notosanskr')
showtext_auto()
disney_review <- read.csv("data/DisneylandReviews.csv")
Customer reviews are critical for a business. It’s an essential factor in a customer’s decision-making process, therefore negative reviews can hurt business reputation, customer loyalty, and sales. But also a business can flip this around and use negative reviews as opportunity and feedback to resolve issues. This is where text analysis(TA) comes in handy. By using TA, we would be able to monitor the volume and sentiment of customer reviews and provide solutions more efficiently.
Therefore with this project on the following database, I will be analyzing reviews for both Disneyland California branch and Paris branch, focusing on terms with negative sentiments using the “Bing” sentiment lexicon. The reason why I chose “Bing” was because of its binary trait, and also since the ratio of negative to positive words in the lexicon is higher than ‘nrc’, it would be easier to detect negative words.
First, I will compare the negative term frequencies of each branch(figure1), and figure out which negative terms were frequently brought up. Then compare the tf-idf(figure2) to decipher certain negative terms that are only brought up in reviews of a specific branch. Lastly by utilizing this, I will generate a network graph employing phi coefficient to identify words that are relatively highly related to specific terms possessing high tf-idf values.
This dataset is public domain which includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor.
There are 6 columns in this dataset which are:
Review_ID: unique id given to each review
Rating: ranging from 1 (unsatisfied) to 5 (satisfied)
Year_Month: when the reviewer visited the theme park
Reviewer_Location: country of origin of visitor
Review_Text: comments made by visitor Disneyland_Branch: location of Disneyland Park
However, for this project I will only use the ‘Review_Text’, and ‘Disney_Branch’ column, since other columns aren’t relevant information for TA.
head(disney_review)
## Review_ID Rating Year_Month Reviewer_Location
## 1 670772142 4 2019-4 Australia
## 2 670682799 4 2019-5 Philippines
## 3 670623270 4 2019-4 United Arab Emirates
## 4 670607911 4 2019-4 Australia
## 5 670607296 4 2019-4 United Kingdom
## 6 670591897 3 2019-4 Singapore
## Review_Text
## 1 If you've ever been to Disneyland anywhere you'll find Disneyland Hong Kong very similar in the layout when you walk into main street! It has a very familiar feel. One of the rides its a Small World is absolutely fabulous and worth doing. The day we visited was fairly hot and relatively busy but the queues moved fairly well.
## 2 Its been a while since d last time we visit HK Disneyland .. Yet, this time we only stay in Tomorrowland .. AKA Marvel land!Now they have Iron Man Experience n d Newly open Ant Man n d Wasp!!Ironman .. Great feature n so Exciting, especially d whole scenery of HK (HK central area to Kowloon)!Antman .. Changed by previous Buzz lightyear! More or less d same, but I'm expecting to have something most!!However, my boys like it!!Space Mountain .. Turns into Star Wars!! This 1 is Great!!!For cast members (staffs) .. Felt bit MINUS point from before!!! Just dun feel like its a Disney brand!! Seems more local like Ocean Park or even worst!!They got no SMILING face, but just wanna u to enter n attraction n leave!!Hello this is supposed to be Happiest Place on Earth brand!! But, just really Dont feel it!!Bakery in Main Street now have more attractive delicacies n Disney theme sweets .. These are Good Points!!Last, they also have Starbucks now inside the theme park!!
## 3 Thanks God it wasn t too hot or too humid when I was visiting the park otherwise it would be a big issue (there is not a lot of shade).I have arrived around 10:30am and left at 6pm. Unfortunately I didn t last until evening parade, but 8.5 hours was too much for me.There is plenty to do and everyone will find something interesting for themselves to enjoy.It wasn t extremely busy and the longest time I had to queue for certain attractions was 45 minutes (which is really not that bad).Although I had an amazing time, I felt a bit underwhelmed with choice of rides and attractions. The park itself is quite small (I was really expecting something grand even the main castle which was closed by the way was quite small).The food options are good, few coffee shops (including Starbucks) and plenty of gift shops. There was no issue with toilets as they are everywhere.All together it was a great day out and I really enjoyed it.
## 4 HK Disneyland is a great compact park. Unfortunately there is quite a bit of maintenance work going on at present so a number of areas are closed off (including the famous castle) If you go midweek, it is not too crowded and certainly no where near as bus as LA Disneyland. We did notice on this visit that prices for food, drinks etc have really gone through the roof so be prepared to pay top dollar for snacks (and avoid the souvenir shops if you can) Regardless, kids will love it.
## 5 the location is not in the city, took around 1 hour from Kowlon, my kids like disneyland so much, everything is fine. but its really crowded and hot in Hong Kong
## 6 Have been to Disney World, Disneyland Anaheim and Tokyo Disneyland but I feel that Disneyland Hong Kong is really too small to be called a Disneyland. It has way too few rides and attractions. Souvenirs, food and even entrance tickets are slightly more expensive than other Disneyland as well. Basically, this park is good only for small children and people who has never been to Disney. The food choices were acceptable, mostly fast food, and not too expensive. Bottled water, however, was VERY expensive but they do have water fountains around for you to refill your water bottles. The parade was pretty good. It was crowded not a problem but what was the problem was the people were just so rude, the pushing and shoving cutting in lines for the rides, gift shops, food stands was just to much to take. forget trying to see one of the shows its a free for all for seats, i don't see how Disney can let this happen, it was by far the worst managed Disney property.
## Branch
## 1 Disneyland_HongKong
## 2 Disneyland_HongKong
## 3 Disneyland_HongKong
## 4 Disneyland_HongKong
## 5 Disneyland_HongKong
## 6 Disneyland_HongKong
Additionally, I have chosen Disneyland California branch and Paris branch for comparison, as they are the top two branches with high review counts.
disney_review %>%
count(Branch, sort = T)
## Branch n
## 1 Disneyland_California 19406
## 2 Disneyland_Paris 13630
## 3 Disneyland_HongKong 9620
In order to visualize the data efficiently, the following data cleaning process would be necessary for all three figures:
- Extracting ‘Review_Text’, and ‘Disney_Branch’ column from the data (‘select()’)
- Tokenization of ‘Review_Text’ column into a separate ‘word’ column (‘unnest_tokens()’)
- Removal of stop words (‘anti_join()’)
- Adding a separate ‘sentiment’ column by matching rows based on the keys, using ‘Bing’ lexicon (‘inner_join()’)
- Extracting rows with ‘negative’ sentiments (‘filter()’)
- Extracting rows from ‘Disneyland_California’, and ‘Disneyland_Paris’ branch (‘filter()’)
In order to compare the term frequencies for both branches, the use of a jitter plot seemed suitable as the aesthetic. Also by utilizing ‘geom_abline()’ which creates a y=x graph on the plot, we can figure out that certain words near the line are used with about equal frequencies for each branch review, while words far away from the line are used much more by one specific branch compared to the other. Therefore can obtain more information regarding term frequency compared to simple box plots.
After the initial data cleaning process, I created an object named ‘negative_review_tf’ which counts how many times a word with a negative sentiment appears for each branch review. This is necessary for creating another object ‘cnp_frequency’, which calculates and provides the frequency of each word used in both branch reviews (frequency = n / total). Here, ‘left_join()’ was used to add a column of the total number of words appearing in each category, and ‘pivot_wider()’ was used to make a differently shaped data frame since I wanted to plot the frequencies on the x- and y-axes of a plot. Then finally, created the plot by using ‘geom_jitter()’ and ‘geom_abline()’.
bing <- get_sentiments("bing")
review_tidy <- disney_review %>%
select(Review_Text, Branch) %>%
unnest_tokens(input = Review_Text,
output = word,
drop = T) %>%
anti_join(stop_words)
frequency <- review_tidy %>%
inner_join(bing) %>%
count(Branch, word, sentiment, sort = T) %>%
group_by(Branch)
negative_review_tf <- frequency %>%
filter(sentiment == "negative")
cnp_frequency <- negative_review_tf %>%
filter(Branch == c("Disneyland_California", "Disneyland_Paris")) %>%
left_join(negative_review_tf %>%
count(Branch, name = "total")) %>%
mutate(freq = n/total) %>%
select(Branch, word, freq) %>%
pivot_wider(names_from = Branch, values_from = freq) %>%
arrange(Disneyland_California, Disneyland_Paris)
head(cnp_frequency)
## # A tibble: 6 × 3
## word Disneyland_California Disneyland_Paris
## <chr> <dbl> <dbl>
## 1 aggressiveness 0.000555 0.000524
## 2 bores 0.000555 0.000524
## 3 detracting 0.000555 0.000524
## 4 distorted 0.000555 0.000524
## 5 exclusion 0.000555 0.000524
## 6 fidget 0.000555 0.000524
cnp_tf_graph <- ggplot(cnp_frequency, aes(x = Disneyland_California, y = Disneyland_Paris)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "blue") +
labs(title = "Fig 1. Negative Term Frequency Comparison Between California and Paris Branch Reviews") +
theme(plot.title = element_text(size = 15), text = element_text(family = 'notosanskr'))
cnp_tf_graph
In this sector, I wanted to do a simple comparison of top 10 words with negative sentiments used in each branch review with the highest tf-idf value using a bar graph. Unlike simply comparing term frequencies like figure 1, this will provide more insight on certain negative terms that holds more weight in a specific branch review.
For plotting figure 2, the use of ‘bind_tf_idf()’ was necessary which automatically creates columns for tf, idf, and tf-idf values of each word. When visualizing, I have faceted the bar graph into two for each branch, and arranged top 10 tf-idf words in a descending order for easier comprehension.
cnp_negative_review_tf <- negative_review_tf %>%
filter(Branch == c("Disneyland_California", "Disneyland_Paris"))
total_words <- cnp_negative_review_tf %>%
group_by(Branch) %>%
summarize(total = sum(n))
cnp_words <- left_join(cnp_negative_review_tf, total_words)
head(cnp_words)
## # A tibble: 6 × 5
## # Groups: Branch [2]
## Branch word sentiment n total
## <chr> <chr> <chr> <int> <int>
## 1 Disneyland_California crowded negative 2596 19845
## 2 Disneyland_Paris bad negative 1245 22695
## 3 Disneyland_California bad negative 1112 19845
## 4 Disneyland_California break negative 954 19845
## 5 Disneyland_Paris cold negative 937 22695
## 6 Disneyland_Paris poor negative 764 22695
disney_tf_idf <- cnp_words %>%
bind_tf_idf(word, Branch, n)
head(disney_tf_idf)
## # A tibble: 6 × 8
## # Groups: Branch [2]
## Branch word sentiment n total tf idf tf_idf
## <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Disneyland_California crowded negative 2596 19845 0.131 0.693 0.0907
## 2 Disneyland_Paris bad negative 1245 22695 0.0549 0 0
## 3 Disneyland_California bad negative 1112 19845 0.0560 0 0
## 4 Disneyland_California break negative 954 19845 0.0481 0.693 0.0333
## 5 Disneyland_Paris cold negative 937 22695 0.0413 0.693 0.0286
## 6 Disneyland_Paris poor negative 764 22695 0.0337 0 0
disney_tf_idf %>%
group_by(Branch) %>%
slice_max(tf_idf, n = 10) %>%
ungroup() %>%
ggplot(aes(tf_idf, reorder_within(word, tf_idf, Branch), fill = Branch)) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
facet_wrap(~Branch, ncol = 2, scales = "free") +
labs(title = "Fig 2. Top 10 Negative Words from California and Paris Branch Reviews Using tf-idf",
x = "tf-idf", y = NULL) +
theme(plot.title = element_text(size = 15), text = element_text(family = 'notosanskr'))
In this part, the goal was to do a semantic network analysis by creating a network graph with phi coefficients to see highly relevant words around terms with high tf-idf values (utilizing figure 2), and understand the context. I created two separate tidy text data, and network graphs for each California and Paris branch for better grasp in comparison.
In tidying up text data for the network graph, the use of ‘pairwise_cor()’ was necessary to find the phi coefficient. Then, by setting up a target object which contains the vector of terms I want to analyze, removed unnecessary information. To ensure that the network is not too complex and is built around highly relevant words, I extracted terms that appear more than 200 times in a branch review, and filtered out word pairs with correlation values lower than 0.09. Also adding network centrality and communities in separate columns was needed to create the plot.
When plotting, I was able to tweak the edge and node color, and their size based on correlation and centrality to improve the visualization process.
cali_word_cors <- disney_review %>%
filter(Branch == c("Disneyland_California")) %>%
unnest_tokens(word, Review_Text, drop = T) %>%
anti_join(stop_words) %>%
add_count(word) %>%
filter(n >= 200) %>%
pairwise_cor(item = word,
feature = Review_ID,
sort = T)
head(cali_word_cors)
## # A tibble: 6 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 jones indiana 0.951
## 2 indiana jones 0.951
## 3 springs radiator 0.921
## 4 radiator springs 0.921
## 5 terror tower 0.911
## 6 tower terror 0.911
target1 <- c("crowded", "break", "broke", "ridiculous", "cheap", "sad", "broken", "scary", "nightmare", "dissapoint")
set.seed(1234)
cali_word_cors %>%
filter(item1 %in% target1) %>%
filter(correlation >= 0.09) %>%
as_tbl_graph(directed = F) %>%
mutate(centrality = centrality_degree(),
group = as.factor(group_infomap())) %>%
ggraph(layout = "fr") +
geom_edge_link(color = "gray50",
aes(edge_alpha = correlation,
edge_width = correlation),
show.legend = F) +
scale_edge_width(range = c(1, 3)) +
geom_node_point(aes(size = centrality,
color = group),
show.legend = F) +
scale_size(range = c(5, 10)) +
geom_node_text(aes(label = name),
repel = T,
size = 6) +
theme_graph()+
labs(title = "Fig 3-1. Network Graph Based on Phi Coefficients for Top 10 Negative Terms Mentioned in Reviews for Disneyland California Branch") +
theme(plot.title = element_text(size = 15), text = element_text(family = 'notosanskr'))
paris_word_cors <- disney_review %>%
filter(Branch == c("Disneyland_Paris")) %>%
unnest_tokens(word, Review_Text, drop = T) %>%
anti_join(stop_words) %>%
add_count(word) %>%
filter(n >= 200) %>%
pairwise_cor(item = word,
feature = Review_ID,
sort = T)
head(paris_word_cors)
## # A tibble: 6 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 jones indiana 0.935
## 2 indiana jones 0.935
## 3 bay newport 0.917
## 4 newport bay 0.917
## 5 phantom manor 0.879
## 6 manor phantom 0.879
target2 <- c("cold", "pan", "tired", "terror", "limited", "wrong", "issue", "stunt", "issues", "dark")
set.seed(1234)
paris_word_cors %>%
filter(item1 %in% target2) %>%
filter(correlation >= 0.09) %>%
as_tbl_graph(directed = F) %>%
mutate(centrality = centrality_degree(),
group = as.factor(group_infomap())) %>%
ggraph(layout = "fr") +
geom_edge_link(color = "gray50",
aes(edge_alpha = correlation,
edge_width = correlation),
show.legend = F) +
scale_edge_width(range = c(1, 3)) +
geom_node_point(aes(size = centrality,
color = group),
show.legend = F) +
scale_size(range = c(5, 10)) +
geom_node_text(aes(label = name),
repel = T,
size = 6) +
theme_graph()+
labs(title = "Fig 3-2. Network Graph Based on Phi Coefficients for Top 10 Negative Terms Mentioned in Reviews for Disneyland Paris Branch") +
theme(plot.title = element_text(size = 15), text = element_text(family = 'notosanskr'))
First, in figure 1, I compared term frequencies for Disneyland California branch, and Paris branch by creating a jitter plot. In the graph, we can see that certain words like ‘peril’, ‘smoke’, ‘mania’, ‘hollow’ are further away from the y=x graph which was set as a standard line. This indicates that these words are used much more frequently by one specific branch compared to the other.
In figure 2, I was able to compare the top 10 words with high tf-idf values. This provides contextual information of each branch. For example, ‘crowded’, and ‘break’ are the top 2 words with high tf-idf for the California branch. With this information, we can assume that visitors had complaints about it being too crowded and things breaking, and that this is a characteristic specific to the California branch.
Lastly in figure 3, I created two network graphs for both branches based on phi coefficients for terms from figure 2. With this result, we can obtain a more detailed insight on the negative terms and why they were used. For instance, for the Paris branch, one of the terms with high tf-idf value was ‘limited’. However with only this data, we don’t know specifically what the visitors were dissatisfied about. But through the network graph, we can understand that the word ‘limited’ is closely related to ‘options’, ‘choice’, and ‘food’. This conveys a story that probably Disneyland Paris branch had limited options of food.
Customer feedback plays a major role in business development. Therefore I think for a business to grow, the active use of TA, like in this project, in customer interaction is inevitable. With negative reviews, a business can come up with solutions and compensations quickly and effectively, and with positive reviews, a business can understand their strong points and reinforce them.