My project used a dataset called Disneyland Reviews found in Kaggle conatining 42,000 reviews from TripAdvisor about three Disneyland: California, Paris, Hong Kong.
library(readr)
Disney <- read_csv("~/Desktop/MEA/DisneylandReviews.csv")
View(Disney)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(tidytext)
It will be meaningful to look at this dataset based on the three different park locations. Therefore, I split up the data based on their brank location and created three new datasets called CA (representing Disneyland located in Anaheim, CA), Paris (representing Disneyland Paris), and HK (representing Hong Kong Disneyland). I also performed some fundamental data analysis to explore and get a general idea about the dataset, including the number of reviews and the average ratings for each park.
CA <- filter(Disney,Branch == "Disneyland_California")
View(CA)
mean(CA$Rating)
## [1] 4.405339
count(CA)
## # A tibble: 1 × 1
## n
## <int>
## 1 19406
There are 19406 reviews about the Disneyland in Anaheim, CA, with an average rating about 4.41.
Paris <- filter(Disney,Branch == "Disneyland_Paris")
View(Paris)
mean(Paris$Rating)
## [1] 3.960088
count(Paris)
## # A tibble: 1 × 1
## n
## <int>
## 1 13630
With an average rating of 3.96, Disneyland Paris has 13630 reviews.
HK <- filter(Disney,Branch == "Disneyland_HongKong")
View(HK)
mean(HK$Rating)
## [1] 4.204158
count(HK)
## # A tibble: 1 × 1
## n
## <int>
## 1 9620
There are 9620 reviews about Hong Kong Disneyland, with an average of about 4.20.
The average rating for each park (out of 5): 4.41 for Anaheim, CA, 3.96 for Paris, and 4.20 for Hong Kong.
Disney %>%
separate(Year_Month, c("Year","Month"),sep = '-', remove=TRUE) -> Disney2
View(Disney2)
I used the separate() function in R to separate the variable Year_Month into two new variables Year and Month.
Disney2 %>%
group_by(Year, Branch) %>%
summarize(average = mean(Rating)) -> avgrating
View(avgrating)
I used the summarize() function in R finding the average rating by year for three different park locations and plot a graph showing how the average rating from each park location changed over time.
na.omit(avgrating) -> avgrating
View(avgrating)
as.numeric(avgrating$Year) -> avgrating$Year
Many reviews in this dataset do not include any information about the year and month. I used na.omit() function to exclude all reviews with missing years and months and plotted a graph showing the average rating for each park from 2010 to 2019. Because the missing values are not included, this graph does not 100% reflect all the values in this dataset.
avgrating %>%
mutate(Branch = recode(Branch, 'Disneyland_California'='California','Disneyland_HongKong' = 'Hong Kong', 'Disneyland_Paris' = "Paris")) -> avgrating
ggplot(avgrating,aes(x=Year, y=average, color = Branch)) +
geom_line() +
labs(title="Average Rating Over Time for Each Park",
x="Year",
y="Average Raring",
color="Location") +
scale_x_continuous(breaks = seq(2010, 2019))
There is a clear difference in the average rating over time for each Disneyland based on the graph. The average rating for Disneyland in California remained the highest out of all parks, but it decreased from 2010 to 2018. Having descended average rate might reflect that more and more visitors were dissatisfied with their experience at Disneyland in California during the time period. The average rating for Disneyland Hong Kong and Disneyland Paris rose between 2010 to 2017. However, there was a drop for both of them starting in 2017. Further investigation is required to determine the cause of the drop in the average rating for each park.
I plotted a column graph for each park based on the Top 10 locations where reviewers are from to investigate visitors’ country of origin for each Disneyland. The nationality of visitors depends on the location of Disneyland.
CA %>%
count(Reviewer_Location) %>%
arrange(desc(n)) %>%
head(10) ->top10CA
View(top10CA)
library(ggthemes)
ggplot(top10CA,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
labs(title="Top 10 Countries of Origin for Reviwer (California)",
x="Count",
y="Reviewer Locaton",
fill="Location") +
scale_x_continuous(breaks = seq(0, 12000,2000))+
scale_fill_hc()
Most of California’s visitors are from the United States.
Paris %>%
count(Reviewer_Location) %>%
arrange(desc(n)) %>%
head(10) ->top10Paris
View(top10Paris)
ggplot(top10Paris,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
labs(title="Top 10 Countries of Origin for Reviwer (Paris)",
x="Count",
y="Reviewer Locaton",
fill="Location") +
scale_x_continuous(breaks = seq(0, 12000,2000))+
scale_fill_hc()
Most of Paris’s visitors are from the United Kingdom, and six out of the top ten countries are in Europe.
HK %>%
count(Reviewer_Location) %>%
arrange(desc(n)) %>%
head(10) ->top10HK
View(top10HK)
ggplot(top10HK,aes(x=n,y=reorder(Reviewer_Location,n), fill=Reviewer_Location)) + geom_col() +
labs(title="Top 10 Countries of Origin for Reviwer (Hong Kong)",
x="Count",
y="Reviewer Locaton",
fill="Location") +
scale_x_continuous(breaks = seq(0, 12000,2000))+
scale_fill_hc()
People who visit the Hong Kong Disneyland are from nearby countries, such as Australia, India, and the Philippines.
family_words <- c("family", "kid","kids", "children","child","mother", "father", "parent", "parents", "son", "sons","daughter","daughters", "mom", "dad","brother", "brothers","sister","sisters","siblings","siblings","cousin","cousins","wife","husband","grandmother","graondfather","grandparet","grandparents","grandson","grandsons","granddaughter", "granddaughters","grandma","grandpa","girl","grils","boy","boys","age","aged")
Disneyland is known as a family vacation destination, and it is always high on the bucket list for many families. I’m interested in exploring the frequency of reviewers that mention visiting Disneyland with their family members and sharing their experience as a family. I came up with a data frame that includes many words related to a family called family_words and used it to determine the number of family words used in visitors’ reviews for each park.
Disney %>%
unnest_tokens(word, Review_Text) %>%
filter(word %in% family_words) %>%
count(Branch, sort = TRUE)
## # A tibble: 3 × 2
## Branch n
## <chr> <int>
## 1 Disneyland_Paris 19805
## 2 Disneyland_California 19492
## 3 Disneyland_HongKong 10038
Family words were used 19,805 times in reviews when visitors talked about their experience about Disneyland Paris, 19,492 times for Disneyland in California, and 10,038 times for Disneyland Hong Kong. Considering the number of reviews for each park in this dataset (13,630 for Paris, 19,406 for California, and 9,620 for Hong Kong), family words were included more often on the Disneyland Paris reviews than the reviews about the other two parks.
wait_words <- c('hour','hours','minutes','long', 'day','days','wait','waiting','line','lines','pass','busy','crowded','stand')
There are usually comments or complaints about the wait time for attractions and rides in Disneyland. Therefore, I designed a data frame including many words related to waiting in lines called wait_words. I’m interested in finding out the number of words in the wait_words data frame for each park in visitors’ reviews.
Disney %>%
unnest_tokens(word, Review_Text) %>%
filter(word %in% wait_words) %>%
count(Branch, sort = TRUE)
## # A tibble: 3 × 2
## Branch n
## <chr> <int>
## 1 Disneyland_California 54722
## 2 Disneyland_Paris 39159
## 3 Disneyland_HongKong 17763
With 54,722 times using wait words, visitors used the most wait words in their reviews for Disneyland in California, compared to 39,159 times for Disneyland Paris and 17,763 times for Disneyland Hong Kong. It is important to consider the difference between each park in the total number of reviews (19,406 for California, 13,630 for Paris, and 9620 for Hong Kong) while looking at the result for wait words.
The result might reflect that Disneyland in California is busier and more crowded than the other two parks based on the number of wait words used by visitors in their review. More visitors may complain about waiting in line and long wait times when visiting Disneyland in California.
Word clouds allow people to conduct data visualization of what visitors talk about when reviewing their Disneyland Park experience. I tokenized words in the reviews and removed stopwords to create the word clouds.
library(knitr)
library(wordcloud2)
I generated a word cloud to have a general idea about reviews for all park locations.I also used the kable() function in R to create a table with the top 20 words used by visitors in their reviews for all park locations.
Disney %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
wordcloud2()
Disney %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
head(20) %>%
kable()
| word | n |
|---|---|
| park | 44557 |
| disney | 36187 |
| rides | 34508 |
| disneyland | 32935 |
| time | 29432 |
| day | 28332 |
| ride | 17792 |
| food | 14322 |
| kids | 14216 |
| visit | 11760 |
| people | 11036 |
| fast | 10067 |
| fun | 9952 |
| parks | 9847 |
| pass | 9823 |
| 2 | 9785 |
| wait | 9710 |
| experience | 8871 |
| times | 8869 |
| days | 8736 |
In general, the reviews for all three locations mention fast passes, rides, attractions, some of the words in the wait_words data frame, and some adjectives describing their experience at Disneyland.
I want to look deeper at what the reviews reveal about visitors’ experience and sentiment at Disneyland, especially for each different park. I used a method that assigns sentiments values from -5 to 5. With a breakpoint of 0, the larger the value, the more positive the expression. The closer to the value -5, the more negative the word is.
By using the filter() function to set the value greater than 0, I created a word cloud showing positive words in the reviews for Disneyland in California. I also used the kable() function in R to make a table with the top 20 positive words in the reviews for Disneyland in California.
CA %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value > 0) %>%
count(word, sort = TRUE) %>%
wordcloud2()
CA %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value > 0) %>%
count(word, sort = TRUE) %>%
head(20) %>%
kable()
| word | n |
|---|---|
| fun | 5047 |
| love | 3487 |
| adventure | 3167 |
| loved | 2788 |
| worth | 2743 |
| attractions | 2446 |
| amazing | 2387 |
| enjoy | 2138 |
| recommend | 1726 |
| nice | 1543 |
| wonderful | 1531 |
| clean | 1474 |
| friendly | 1399 |
| happy | 1235 |
| awesome | 1172 |
| favorite | 1153 |
| pretty | 1027 |
| fantastic | 930 |
| popular | 923 |
| helpful | 909 |
I applied the same method, making word clouds showing the positive words in the reviews for Disneyland Pairs and Disneyland Hong Kong and creating tables for these two parks with the top 20 positive words used in the reviews.
HK %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value > 0) %>%
count(word, sort = TRUE) %>%
wordcloud2()
HK %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value > 0) %>%
count(word, sort = TRUE) %>%
head(20) %>%
kable()
| word | n |
|---|---|
| fun | 2375 |
| attractions | 1640 |
| worth | 1394 |
| enjoy | 1254 |
| nice | 1131 |
| loved | 1044 |
| love | 932 |
| easy | 890 |
| amazing | 856 |
| recommend | 805 |
| attraction | 714 |
| clean | 619 |
| friendly | 579 |
| awesome | 498 |
| wonderful | 492 |
| happy | 476 |
| pretty | 475 |
| fantastic | 437 |
| huge | 374 |
| beautiful | 352 |
Paris %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value > 0) %>%
count(word, sort = TRUE) %>%
wordcloud2()
Paris %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value > 0) %>%
count(word, sort = TRUE) %>%
head(20) %>%
kable()
| word | n |
|---|---|
| worth | 2853 |
| amazing | 2736 |
| attractions | 2728 |
| loved | 2610 |
| fun | 2530 |
| recommend | 1851 |
| nice | 1660 |
| fantastic | 1597 |
| love | 1548 |
| clean | 1453 |
| enjoy | 1415 |
| friendly | 1203 |
| lovely | 1065 |
| free | 1025 |
| helpful | 947 |
| wonderful | 895 |
| brilliant | 894 |
| pretty | 890 |
| huge | 886 |
| happy | 883 |
For visualizing the negative words for Disneyland in California, I used the filter() function to set the value smaller than 0 and created a word cloud showing negative words in the reviews for the park. kable() function in R is used to make a table with the top 20 negative words for Disneyland in California.
CA %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value < 0) %>%
count(word, sort = TRUE) %>%
wordcloud2()
CA %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value < 0) %>%
count(word, sort = TRUE) %>%
head(20) %>%
kable()
| word | n |
|---|---|
| haunted | 1174 |
| bad | 1112 |
| disappointed | 1025 |
| pay | 800 |
| avoid | 751 |
| leave | 742 |
| miss | 690 |
| hard | 649 |
| crazy | 471 |
| missed | 463 |
| tired | 456 |
| stop | 419 |
| broke | 418 |
| forget | 414 |
| wrong | 408 |
| disappointing | 390 |
| lost | 358 |
| ridiculous | 301 |
| disappointment | 295 |
| waste | 278 |
The same method is used for the other two parks creating word clouds showing the negative words in the reviews and making tables with the top 20 negative words for each park.
HK %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value < 0) %>%
count(word, sort = TRUE) %>%
wordcloud2()
HK %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value < 0) %>%
count(word, sort = TRUE) %>%
head(20) %>%
kable()
| word | n |
|---|---|
| miss | 686 |
| avoid | 461 |
| bad | 458 |
| disappointed | 412 |
| missed | 351 |
| limited | 268 |
| stop | 229 |
| forget | 222 |
| leave | 211 |
| disappointing | 197 |
| pay | 180 |
| tired | 176 |
| fire | 173 |
| haunted | 172 |
| hard | 164 |
| scary | 156 |
| waste | 148 |
| disappointment | 144 |
| drop | 140 |
| cut | 124 |
Paris %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value < 0) %>%
count(word, sort = TRUE) %>%
wordcloud2()
Paris %>%
unnest_tokens(word, Review_Text) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value < 0) %>%
count(word, sort = TRUE) %>%
head(20) %>%
kable()
| word | n |
|---|---|
| bad | 1245 |
| disappointed | 1203 |
| pay | 896 |
| poor | 764 |
| miss | 733 |
| avoid | 694 |
| leave | 647 |
| disappointing | 601 |
| lack | 596 |
| tired | 565 |
| hard | 554 |
| stop | 530 |
| missed | 500 |
| waste | 434 |
| broke | 432 |
| terror | 421 |
| disappointment | 403 |
| limited | 398 |
| wrong | 387 |
| shame | 374 |
There is not much difference in either positive or negative words visitors use to share and describe their experience at each Disneyland. Both positive and negative words for each park are very similar. Future research should use another way to break down the dataset, investigating the difference between each park on visitors’ review.