My project used a dataset called Disneyland Reviews found in Kaggle conatining 42,000 reviews from TripAdvisor about three Disneyland: California, Paris, Hong Kong.

library(readr)
Disney <- read_csv("~/Desktop/MEA/DisneylandReviews.csv")
View(Disney)

library(dplyr)
library(ggplot2)
library(tidyverse)
library(tidytext)

It will be meaningful to look at this dataset based on the three different park locations. Therefore, I split up the data based on their brank location and created three new datasets called CA (representing Disneyland located in Anaheim, CA), Paris (representing Disneyland Paris), and HK (representing Hong Kong Disneyland). I also performed some fundamental data analysis to explore and get a general idea about the dataset, including the number of reviews and the average ratings for each park.

California

CA <- filter(Disney,Branch == "Disneyland_California")
View(CA)
mean(CA$Rating)
## [1] 4.405339
count(CA)
## # A tibble: 1 × 1
##       n
##   <int>
## 1 19406

There are 19406 reviews about the Disneyland in Anaheim, CA, with an average rating about 4.41.

Paris

Paris <- filter(Disney,Branch == "Disneyland_Paris")
View(Paris)
mean(Paris$Rating)
## [1] 3.960088
count(Paris)
## # A tibble: 1 × 1
##       n
##   <int>
## 1 13630

With an average rating of 3.96, Disneyland Paris has 13630 reviews.

Hong Kong

HK <- filter(Disney,Branch == "Disneyland_HongKong")
View(HK)
mean(HK$Rating)
## [1] 4.204158
count(HK)
## # A tibble: 1 × 1
##       n
##   <int>
## 1  9620

There are 9620 reviews about Hong Kong Disneyland, with an average of about 4.20.

In Summary

The average rating for each park (out of 5): 4.41 for Anaheim, CA, 3.96 for Paris, and 4.20 for Hong Kong.

Disney %>%
  separate(Year_Month, c("Year","Month"),sep = '-', remove=TRUE) -> Disney2
View(Disney2)

I used the separate() function in R to separate the variable Year_Month into two new variables Year and Month.

Disney2 %>%
  group_by(Year, Branch) %>%
  summarize(average = mean(Rating)) -> avgrating
View(avgrating)

I used the summarize() function in R finding the average rating by year for three different park locations and plot a graph showing how the average rating from each park location changed over time.

na.omit(avgrating) -> avgrating
View(avgrating)
as.numeric(avgrating$Year) -> avgrating$Year

Many reviews in this dataset do not include any information about the year and month. I used na.omit() function to exclude all reviews with missing years and months and plotted a graph showing the average rating for each park from 2010 to 2019. Because the missing values are not included, this graph does not 100% reflect all the values in this dataset.

avgrating %>%
  mutate(Branch = recode(Branch, 'Disneyland_California'='California','Disneyland_HongKong' = 'Hong Kong', 'Disneyland_Paris' = "Paris")) -> avgrating
ggplot(avgrating,aes(x=Year, y=average, color = Branch)) + 
  geom_line() +
  labs(title="Average Rating Over Time for Each Park",
       x="Year",
       y="Average Raring",
       color="Location") +
  scale_x_continuous(breaks = seq(2010, 2019))

There is a clear difference in the average rating over time for each Disneyland based on the graph. The average rating for Disneyland in California remained the highest out of all parks, but it decreased from 2010 to 2018. Having descended average rate might reflect that more and more visitors were dissatisfied with their experience at Disneyland in California during the time period. The average rating for Disneyland Hong Kong and Disneyland Paris rose between 2010 to 2017. However, there was a drop for both of them starting in 2017. Further investigation is required to determine the cause of the drop in the average rating for each park.

Visitors’ Country of Origin

I plotted a column graph for each park based on the Top 10 locations where reviewers are from to investigate visitors’ country of origin for each Disneyland. The nationality of visitors depends on the location of Disneyland.

California

CA %>%
  count(Reviewer_Location) %>%
  arrange(desc(n)) %>%
  head(10) ->top10CA
View(top10CA)

library(ggthemes)
ggplot(top10CA,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
  labs(title="Top 10 Countries of Origin for Reviwer (California)",
       x="Count",
       y="Reviewer Locaton",
       fill="Location") +
  scale_x_continuous(breaks = seq(0, 12000,2000))+
  scale_fill_hc()

Most of California’s visitors are from the United States.

Paris

Paris %>%
  count(Reviewer_Location) %>%
  arrange(desc(n)) %>%
  head(10) ->top10Paris
View(top10Paris)

ggplot(top10Paris,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
  labs(title="Top 10 Countries of Origin for Reviwer (Paris)",
       x="Count",
       y="Reviewer Locaton",
       fill="Location") +
  scale_x_continuous(breaks = seq(0, 12000,2000))+
  scale_fill_hc()

Most of Paris’s visitors are from the United Kingdom, and six out of the top ten countries are in Europe.

Hong Kong

HK %>%
  count(Reviewer_Location) %>%
  arrange(desc(n)) %>%
  head(10) ->top10HK
View(top10HK)

ggplot(top10HK,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
  labs(title="Top 10 Countries of Origin for Reviwer (Hong Kong)",
       x="Count",
       y="Reviewer Locaton",
       fill="Location") +
  scale_x_continuous(breaks = seq(0, 12000,2000))+
  scale_fill_hc()

People who visit the Hong Kong Disneyland are from nearby countries, such as Australia, India, and the Philippines.

Family Words

family_words <- c("family", "kid","kids", "children","child","mother", "father", "parent", "parents", "son", "sons","daughter","daughters", "mom", "dad","brother", "brothers","sister","sisters","siblings","siblings","cousin","cousins","wife","husband","grandmother","graondfather","grandparet","grandparents","grandson","grandsons","granddaughter", "granddaughters","grandma","grandpa","girl","grils","boy","boys","age","aged")

Disneyland is known as a family vacation destination, and it is always high on the bucket list for many families. I’m interested in exploring the frequency of reviewers that mention visiting Disneyland with their family members and sharing their experience as a family. I came up with a data frame that includes many words related to a family called family_words and used it to determine the number of family words used in visitors’ reviews for each park.

Number of Family Words for Each Park

Disney %>% 
  unnest_tokens(word, Review_Text) %>% 
  filter(word %in% family_words) %>% 
  count(Branch, sort = TRUE)
## # A tibble: 3 × 2
##   Branch                    n
##   <chr>                 <int>
## 1 Disneyland_Paris      19805
## 2 Disneyland_California 19492
## 3 Disneyland_HongKong   10038

Family words were used 19,805 times in reviews when visitors talked about their experience about Disneyland Paris, 19,492 times for Disneyland in California, and 10,038 times for Disneyland Hong Kong. Considering the number of reviews for each park in this dataset (13,630 for Paris, 19,406 for California, and 9,620 for Hong Kong), family words were included more often on the Disneyland Paris reviews than the reviews about the other two parks.

Wait Words

wait_words <- c('hour','hours','minutes','long', 'day','days','wait','waiting','line','lines','pass','busy','crowded','stand')

There are usually comments or complaints about the wait time for attractions and rides in Disneyland. Therefore, I designed a data frame including many words related to waiting in lines called wait_words. I’m interested in finding out the number of words in the wait_words data frame for each park in visitors’ reviews.

Number of Wait Words for Each Park

Disney %>% 
  unnest_tokens(word, Review_Text) %>% 
  filter(word %in% wait_words) %>% 
  count(Branch, sort = TRUE)
## # A tibble: 3 × 2
##   Branch                    n
##   <chr>                 <int>
## 1 Disneyland_California 54722
## 2 Disneyland_Paris      39159
## 3 Disneyland_HongKong   17763

With 54,722 times using wait words, visitors used the most wait words in their reviews for Disneyland in California, compared to 39,159 times for Disneyland Paris and 17,763 times for Disneyland Hong Kong. It is important to consider the difference between each park in the total number of reviews (19,406 for California, 13,630 for Paris, and 9620 for Hong Kong) while looking at the result for wait words.

The result might reflect that Disneyland in California is busier and more crowded than the other two parks based on the number of wait words used by visitors in their review. More visitors may complain about waiting in line and long wait times when visiting Disneyland in California.

Word Clouds

Word clouds allow people to conduct data visualization of what visitors talk about when reviewing their Disneyland Park experience. I tokenized words in the reviews and removed stopwords to create the word clouds.

library(knitr)
library(wordcloud2)

I generated a word cloud to have a general idea about reviews for all park locations.I also used the kable() function in R to create a table with the top 20 words used by visitors in their reviews for all park locations.

Disney %>% 
  unnest_tokens(word, Review_Text) %>% 
  anti_join(stop_words) %>%   
  count(word, sort = TRUE) %>%
  wordcloud2()
Disney %>% 
  unnest_tokens(word, Review_Text) %>% 
  anti_join(stop_words) %>%   
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()
word n
park 44557
disney 36187
rides 34508
disneyland 32935
time 29432
day 28332
ride 17792
food 14322
kids 14216
visit 11760
people 11036
fast 10067
fun 9952
parks 9847
pass 9823
2 9785
wait 9710
experience 8871
times 8869
days 8736

In general, the reviews for all three locations mention fast passes, rides, attractions, some of the words in the wait_words data frame, and some adjectives describing their experience at Disneyland.

I want to look deeper at what the reviews reveal about visitors’ experience and sentiment at Disneyland, especially for each different park. I used a method that assigns sentiments values from -5 to 5. With a breakpoint of 0, the larger the value, the more positive the expression. The closer to the value -5, the more negative the word is.

By using the filter() function to set the value greater than 0, I created a word cloud showing positive words in the reviews for Disneyland in California. I also used the kable() function in R to make a table with the top 20 positive words in the reviews for Disneyland in California.

Positive Words for CA

CA %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()
CA %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()
word n
fun 5047
love 3487
adventure 3167
loved 2788
worth 2743
attractions 2446
amazing 2387
enjoy 2138
recommend 1726
nice 1543
wonderful 1531
clean 1474
friendly 1399
happy 1235
awesome 1172
favorite 1153
pretty 1027
fantastic 930
popular 923
helpful 909

I applied the same method, making word clouds showing the positive words in the reviews for Disneyland Pairs and Disneyland Hong Kong and creating tables for these two parks with the top 20 positive words used in the reviews.

Positive Words for HK

HK %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()
HK %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()
word n
fun 2375
attractions 1640
worth 1394
enjoy 1254
nice 1131
loved 1044
love 932
easy 890
amazing 856
recommend 805
attraction 714
clean 619
friendly 579
awesome 498
wonderful 492
happy 476
pretty 475
fantastic 437
huge 374
beautiful 352

Positive Words for Paris

Paris %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()
Paris %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()
word n
worth 2853
amazing 2736
attractions 2728
loved 2610
fun 2530
recommend 1851
nice 1660
fantastic 1597
love 1548
clean 1453
enjoy 1415
friendly 1203
lovely 1065
free 1025
helpful 947
wonderful 895
brilliant 894
pretty 890
huge 886
happy 883

For visualizing the negative words for Disneyland in California, I used the filter() function to set the value smaller than 0 and created a word cloud showing negative words in the reviews for the park. kable() function in R is used to make a table with the top 20 negative words for Disneyland in California.

Negative Words for CA

CA %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()
CA %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()
word n
haunted 1174
bad 1112
disappointed 1025
pay 800
avoid 751
leave 742
miss 690
hard 649
crazy 471
missed 463
tired 456
stop 419
broke 418
forget 414
wrong 408
disappointing 390
lost 358
ridiculous 301
disappointment 295
waste 278

The same method is used for the other two parks creating word clouds showing the negative words in the reviews and making tables with the top 20 negative words for each park.

Negative Words for HK

HK %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()
HK %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()
word n
miss 686
avoid 461
bad 458
disappointed 412
missed 351
limited 268
stop 229
forget 222
leave 211
disappointing 197
pay 180
tired 176
fire 173
haunted 172
hard 164
scary 156
waste 148
disappointment 144
drop 140
cut 124

Negative Words for Paris

Paris %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()
Paris %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()
word n
bad 1245
disappointed 1203
pay 896
poor 764
miss 733
avoid 694
leave 647
disappointing 601
lack 596
tired 565
hard 554
stop 530
missed 500
waste 434
broke 432
terror 421
disappointment 403
limited 398
wrong 387
shame 374

There is not much difference in either positive or negative words visitors use to share and describe their experience at each Disneyland. Both positive and negative words for each park are very similar. Future research should use another way to break down the dataset, investigating the difference between each park on visitors’ review.