Disneyland Reviews Analysis

My project used a dataset called Disneyland Reviews found in Kaggle conatining 42,000 reviews from TripAdvisor about three Disneyland: California, Paris, Hong Kong.

library(readr)
Disney <- read_csv("~/Desktop/MEA/DisneylandReviews.csv")
View(Disney)

library(dplyr)
library(ggplot2)
library(tidyverse)
library(tidytext)

It will be meaningful to look at this dataset based on the three different park locations. Therefore, I split up the data based on their brank location and created three new datasets called CA (representing Disneyland located in Anaheim, CA), Paris (representing Disneyland Paris), and HK (representing Hong Kong Disneyland). I also performed some fundamental data analysis to explore and get a general idea about the dataset, including the number of reviews and the average ratings for each park.

California

CA <- filter(Disney,Branch == "Disneyland_California")
View(CA)
mean(CA$Rating)

## [1] 4.405339

count(CA)

## # A tibble: 1 × 1
##       n
##   <int>
## 1 19406

There are 19406 reviews about the Disneyland in Anaheim, CA, with an average rating about 4.41.

Paris

Paris <- filter(Disney,Branch == "Disneyland_Paris")
View(Paris)
mean(Paris$Rating)

## [1] 3.960088

count(Paris)

## # A tibble: 1 × 1
##       n
##   <int>
## 1 13630

With an average rating of 3.96, Disneyland Paris has 13630 reviews.

Hong Kong

HK <- filter(Disney,Branch == "Disneyland_HongKong")
View(HK)
mean(HK$Rating)

## [1] 4.204158

count(HK)

## # A tibble: 1 × 1
##       n
##   <int>
## 1  9620

There are 9620 reviews about Hong Kong Disneyland, with an average of about 4.20.

In Summary

The average rating for each park (out of 5): 4.41 for Anaheim, CA, 3.96 for Paris, and 4.20 for Hong Kong.

Disney %>%
  separate(Year_Month, c("Year","Month"),sep = '-', remove=TRUE) -> Disney2
View(Disney2)

I used the separate() function in R to separate the variable Year_Month into two new variables Year and Month.

Disney2 %>%
  group_by(Year, Branch) %>%
  summarize(average = mean(Rating)) -> avgrating
View(avgrating)

I used the summarize() function in R finding the average rating by year for three different park locations and plot a graph showing how the average rating from each park location changed over time.

na.omit(avgrating) -> avgrating
View(avgrating)
as.numeric(avgrating$Year) -> avgrating$Year

Many reviews in this dataset do not include any information about the year and month. I used na.omit() function to exclude all reviews with missing years and months and plotted a graph showing the average rating for each park from 2010 to 2019. Because the missing values are not included, this graph does not 100% reflect all the values in this dataset.

avgrating %>%
  mutate(Branch = recode(Branch, 'Disneyland_California'='California','Disneyland_HongKong' = 'Hong Kong', 'Disneyland_Paris' = "Paris")) -> avgrating
ggplot(avgrating,aes(x=Year, y=average, color = Branch)) + 
  geom_line() +
  labs(title="Average Rating Over Time for Each Park",
       x="Year",
       y="Average Raring",
       color="Location") +
  scale_x_continuous(breaks = seq(2010, 2019))

There is a clear difference in the average rating over time for each Disneyland based on the graph. The average rating for Disneyland in California remained the highest out of all parks, but it decreased from 2010 to 2018. Having descended average rate might reflect that more and more visitors were dissatisfied with their experience at Disneyland in California during the time period. The average rating for Disneyland Hong Kong and Disneyland Paris rose between 2010 to 2017. However, there was a drop for both of them starting in 2017. Further investigation is required to determine the cause of the drop in the average rating for each park.

Visitors’ Country of Origin

I plotted a column graph for each park based on the Top 10 locations where reviewers are from to investigate visitors’ country of origin for each Disneyland. The nationality of visitors depends on the location of Disneyland.

California

CA %>%
  count(Reviewer_Location) %>%
  arrange(desc(n)) %>%
  head(10) ->top10CA
View(top10CA)

library(ggthemes)
ggplot(top10CA,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
  labs(title="Top 10 Countries of Origin for Reviwer (California)",
       x="Count",
       y="Reviewer Locaton",
       fill="Location") +
  scale_x_continuous(breaks = seq(0, 12000,2000))+
  scale_fill_hc()

Most of California’s visitors are from the United States.

Paris

Paris %>%
  count(Reviewer_Location) %>%
  arrange(desc(n)) %>%
  head(10) ->top10Paris
View(top10Paris)

ggplot(top10Paris,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
  labs(title="Top 10 Countries of Origin for Reviwer (Paris)",
       x="Count",
       y="Reviewer Locaton",
       fill="Location") +
  scale_x_continuous(breaks = seq(0, 12000,2000))+
  scale_fill_hc()

Most of Paris’s visitors are from the United Kingdom, and six out of the top ten countries are in Europe.

Hong Kong

HK %>%
  count(Reviewer_Location) %>%
  arrange(desc(n)) %>%
  head(10) ->top10HK
View(top10HK)

ggplot(top10HK,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
  labs(title="Top 10 Countries of Origin for Reviwer (Hong Kong)",
       x="Count",
       y="Reviewer Locaton",
       fill="Location") +
  scale_x_continuous(breaks = seq(0, 12000,2000))+
  scale_fill_hc()

People who visit the Hong Kong Disneyland are from nearby countries, such as Australia, India, and the Philippines.

Family Words

family_words <- c("family", "kid","kids", "children","child","mother", "father", "parent", "parents", "son", "sons","daughter","daughters", "mom", "dad","brother", "brothers","sister","sisters","siblings","siblings","cousin","cousins","wife","husband","grandmother","graondfather","grandparet","grandparents","grandson","grandsons","granddaughter", "granddaughters","grandma","grandpa","girl","grils","boy","boys","age","aged")

Disneyland is known as a family vacation destination, and it is always high on the bucket list for many families. I’m interested in exploring the frequency of reviewers that mention visiting Disneyland with their family members and sharing their experience as a family. I came up with a data frame that includes many words related to a family called family_words and used it to determine the number of family words used in visitors’ reviews for each park.

Number of Family Words for Each Park

Disney %>% 
  unnest_tokens(word, Review_Text) %>% 
  filter(word %in% family_words) %>% 
  count(Branch, sort = TRUE)

## # A tibble: 3 × 2
##   Branch                    n
##   <chr>                 <int>
## 1 Disneyland_Paris      19805
## 2 Disneyland_California 19492
## 3 Disneyland_HongKong   10038

Family words were used 19,805 times in reviews when visitors talked about their experience about Disneyland Paris, 19,492 times for Disneyland in California, and 10,038 times for Disneyland Hong Kong. Considering the number of reviews for each park in this dataset (13,630 for Paris, 19,406 for California, and 9,620 for Hong Kong), family words were included more often on the Disneyland Paris reviews than the reviews about the other two parks.

Wait Words

wait_words <- c('hour','hours','minutes','long', 'day','days','wait','waiting','line','lines','pass','busy','crowded','stand')

There are usually comments or complaints about the wait time for attractions and rides in Disneyland. Therefore, I designed a data frame including many words related to waiting in lines called wait_words. I’m interested in finding out the number of words in the wait_words data frame for each park in visitors’ reviews.

Number of Wait Words for Each Park

Disney %>% 
  unnest_tokens(word, Review_Text) %>% 
  filter(word %in% wait_words) %>% 
  count(Branch, sort = TRUE)

## # A tibble: 3 × 2
##   Branch                    n
##   <chr>                 <int>
## 1 Disneyland_California 54722
## 2 Disneyland_Paris      39159
## 3 Disneyland_HongKong   17763

With 54,722 times using wait words, visitors used the most wait words in their reviews for Disneyland in California, compared to 39,159 times for Disneyland Paris and 17,763 times for Disneyland Hong Kong. It is important to consider the difference between each park in the total number of reviews (19,406 for California, 13,630 for Paris, and 9620 for Hong Kong) while looking at the result for wait words.

The result might reflect that Disneyland in California is busier and more crowded than the other two parks based on the number of wait words used by visitors in their review. More visitors may complain about waiting in line and long wait times when visiting Disneyland in California.

Word Clouds

Word clouds allow people to conduct data visualization of what visitors talk about when reviewing their Disneyland Park experience. I tokenized words in the reviews and removed stopwords to create the word clouds.

library(knitr)
library(wordcloud2)

I generated a word cloud to have a general idea about reviews for all park locations.I also used the kable() function in R to create a table with the top 20 words used by visitors in their reviews for all park locations.

Disney %>% 
  unnest_tokens(word, Review_Text) %>% 
  anti_join(stop_words) %>%   
  count(word, sort = TRUE) %>%
  wordcloud2()

Disney %>% 
  unnest_tokens(word, Review_Text) %>% 
  anti_join(stop_words) %>%   
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()

word	n
park	44557
disney	36187
rides	34508
disneyland	32935
time	29432
day	28332
ride	17792
food	14322
kids	14216
visit	11760
people	11036
fast	10067
fun	9952
parks	9847
pass	9823
2	9785
wait	9710
experience	8871
times	8869
days	8736

In general, the reviews for all three locations mention fast passes, rides, attractions, some of the words in the wait_words data frame, and some adjectives describing their experience at Disneyland.

I want to look deeper at what the reviews reveal about visitors’ experience and sentiment at Disneyland, especially for each different park. I used a method that assigns sentiments values from -5 to 5. With a breakpoint of 0, the larger the value, the more positive the expression. The closer to the value -5, the more negative the word is.

By using the filter() function to set the value greater than 0, I created a word cloud showing positive words in the reviews for Disneyland in California. I also used the kable() function in R to make a table with the top 20 positive words in the reviews for Disneyland in California.

Positive Words for CA

CA %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()

CA %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()

word	n
fun	5047
love	3487
adventure	3167
loved	2788
worth	2743
attractions	2446
amazing	2387
enjoy	2138
recommend	1726
nice	1543
wonderful	1531
clean	1474
friendly	1399
happy	1235
awesome	1172
favorite	1153
pretty	1027
fantastic	930
popular	923
helpful	909

I applied the same method, making word clouds showing the positive words in the reviews for Disneyland Pairs and Disneyland Hong Kong and creating tables for these two parks with the top 20 positive words used in the reviews.

Positive Words for HK

HK %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()

HK %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()

word	n
fun	2375
attractions	1640
worth	1394
enjoy	1254
nice	1131
loved	1044
love	932
easy	890
amazing	856
recommend	805
attraction	714
clean	619
friendly	579
awesome	498
wonderful	492
happy	476
pretty	475
fantastic	437
huge	374
beautiful	352

Positive Words for Paris

Paris %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()

Paris %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value > 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()

word	n
worth	2853
amazing	2736
attractions	2728
loved	2610
fun	2530
recommend	1851
nice	1660
fantastic	1597
love	1548
clean	1453
enjoy	1415
friendly	1203
lovely	1065
free	1025
helpful	947
wonderful	895
brilliant	894
pretty	890
huge	886
happy	883

For visualizing the negative words for Disneyland in California, I used the filter() function to set the value smaller than 0 and created a word cloud showing negative words in the reviews for the park. kable() function in R is used to make a table with the top 20 negative words for Disneyland in California.

Negative Words for CA

CA %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()

CA %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()

word	n
haunted	1174
bad	1112
disappointed	1025
pay	800
avoid	751
leave	742
miss	690
hard	649
crazy	471
missed	463
tired	456
stop	419
broke	418
forget	414
wrong	408
disappointing	390
lost	358
ridiculous	301
disappointment	295
waste	278

The same method is used for the other two parks creating word clouds showing the negative words in the reviews and making tables with the top 20 negative words for each park.

Negative Words for HK

HK %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()

HK %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()

word	n
miss	686
avoid	461
bad	458
disappointed	412
missed	351
limited	268
stop	229
forget	222
leave	211
disappointing	197
pay	180
tired	176
fire	173
haunted	172
hard	164
scary	156
waste	148
disappointment	144
drop	140
cut	124

Negative Words for Paris

Paris %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  wordcloud2()

Paris %>%
  unnest_tokens(word, Review_Text) %>%
  anti_join(stop_words) %>%
  inner_join(get_sentiments('afinn')) %>%
  filter(value < 0) %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  kable()

word	n
bad	1245
disappointed	1203
pay	896
poor	764
miss	733
avoid	694
leave	647
disappointing	601
lack	596
tired	565
hard	554
stop	530
missed	500
waste	434
broke	432
terror	421
disappointment	403
limited	398
wrong	387
shame	374

There is not much difference in either positive or negative words visitors use to share and describe their experience at each Disneyland. Both positive and negative words for each park are very similar. Future research should use another way to break down the dataset, investigating the difference between each park on visitors’ review.

Disneyland Reviews Analysis

Winnie Yang

3/28/2022

California

Paris

Hong Kong

In Summary

Visitors’ Country of Origin

California

Paris

Hong Kong

Family Words

Number of Family Words for Each Park

Wait Words

Number of Wait Words for Each Park

Word Clouds

Positive Words for CA

Positive Words for HK

Positive Words for Paris

Negative Words for CA

Negative Words for HK

Negative Words for Paris