Let’s load the data to be used, because the selected database file is too large, the data has been screened before import. After processing, 4264 pieces of data remain.

dsy <- read.csv("./data/DisneylandReviews.csv")

Executive summary

With the continuous release of Disney’s series of movies, people’s attention to it is also increasing. At the same time, Disneyland parks were established in different countries around the world. Although all thress are Disneyland parks, they seem to receive different reviews from visitors. How popular are Disneyland parks in different regions with visitors?(Number of three Disney visitors) What do visitors from different countries say about different Disneyland parks? (Negative/Positive) What are the main reviews of different Disneyland parks?(Most words/phrases used in comments) I’m going to analyze comments from visitors to Disney parks in three different locations: Paris, California and Hong Kong. I think the analysis results will be of great help to the operation of Disneyland in the future, and relevant improvements can be made through the comments of tourists to pursue better experience.

Data background

The dataset includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor. The database contains several columns such as Review_ID(unique id given to each review),Rating(ranging from 1 (unsatisfied) to 5 (satisfied)),Year_Month(when the reviewer visited the theme park),Reviewer_Location(country of origin of visitor),Review_Text(comments made by visitor) and Disneyland_Branch(location of Disneyland Park). These data can be used to analyze the preferences and real experience evaluations of visitors from different countries to Disneyland in different regions. There are a total of 42656 pieces of data and 6 variables.

Data cleaning and preprocessing

I will select the columns that need in the analysis and rename them for subsequent operations.

#Select the columns i want to use and rename them
dsy_new <- select(dsy,
                  Disneyland = Branch, 
                  Nationality = Reviewer_Location, 
                  Review = Review_Text)

After the selection, the dataset left with 3 variables.

Because the selected database is relatively large, I will select 4000 of them for analysis.

dsy_new <- dsy_new %>%
  select(Disneyland, Nationality, Review) %>%
  mutate(ID = row_number())

dsy_new <- dsy_new %>% 
  sample_n(4000, replace = F)

dsy_new <- dsy_new %>%
  select(Disneyland, Nationality, Review)

After the seletion, 4000 pieces of data remain.

Back up the simplified data.

write.csv(dsy_new, "./data/dsy_new.csv" )

Individual analysis and figures

Anaysis and Figure 1

First of all, the ratings of visitors to different areas of Disneyland reflect the popularity of different areas.

dsy_new <- dsy_new %>%
  mutate(Nationality =str_replace_all(Nationality, "[^[:graph:]]", " ")) 
dsy_new %>% count(Nationality)
##               Nationality    n
## 1             Afghanistan    1
## 2               Argentina    3
## 3               Australia  460
## 4                 Austria    2
## 5                 Bahrain    3
## 6                 Belgium   11
## 7  Bosnia and Herzegovina    2
## 8                  Brazil    8
## 9                  Brunei    3
## 10               Bulgaria    2
## 11                 Canada  202
## 12                  Chile    2
## 13                  China   20
## 14               Colombia    1
## 15                Croatia    1
## 16                 Cyprus    4
## 17                Czechia    3
## 18                Denmark    5
## 19                  Egypt    6
## 20                Estonia    2
## 21               Ethiopia    1
## 22                Finland    8
## 23                 France   24
## 24       French Polynesia    1
## 25                Germany   23
## 26              Gibraltar    2
## 27                 Greece   10
## 28              Guatemala    2
## 29              Hong Kong   72
## 30                Hungary    3
## 31                  India  162
## 32              Indonesia   49
## 33                   Iran    4
## 34                Ireland   46
## 35            Isle of Man    2
## 36                 Israel    9
## 37                  Italy   18
## 38                  Japan    5
## 39             Kazakhstan    1
## 40                  Kenya    2
## 41                 Kuwait    5
## 42                   Laos    1
## 43                 Latvia    1
## 44                Lebanon    4
## 45              Lithuania    1
## 46             Luxembourg    1
## 47                  Macau    9
## 48               Malaysia   61
## 49               Maldives    1
## 50                   Mali    1
## 51                  Malta    8
## 52              Mauritius    2
## 53                 Mexico    8
## 54                 Monaco    1
## 55               Mongolia    1
## 56                Morocco    1
## 57        Myanmar (Burma)    2
## 58                Namibia    1
## 59            Netherlands   20
## 60            New Zealand   72
## 61                Nigeria    5
## 62                 Norway   10
## 63                   Oman    1
## 64               Pakistan    3
## 65                 Panama    1
## 66                   Peru    2
## 67            Philippines  129
## 68                 Poland    4
## 69               Portugal   14
## 70                  Qatar    5
## 71                Romania    8
## 72                 Russia    4
## 73           Saudi Arabia   12
## 74             Seychelles    1
## 75              Singapore  110
## 76           South Africa   16
## 77            South Korea    6
## 78                  Spain   15
## 79              Sri Lanka    7
## 80                 Sweden   12
## 81            Switzerland   13
## 82                 Taiwan    2
## 83               Thailand   24
## 84                 Turkey    5
## 85                Ukraine    1
## 86   United Arab Emirates   33
## 87         United Kingdom  894
## 88          United States 1266
## 89                Uruguay    2
## 90                Vanuatu    1
## 91                Vietnam    7
## 92               Zimbabwe    1
#Because there are too many data in the column of countries, it is not clear to display it in a chart. Therefore, the top 30 countries with the largest number will be selected for analysis.
dsy_1 <- dsy_new %>% 
  group_by(Disneyland) %>%
  count(Nationality, sort = T) %>%
  slice_max(n, n =20)

I’ll start by comparing the number of visitors to Disney parks in three different regions with a bar chart to see which park visitors prefer.

p1 <- ggplot(data = dsy_1,
             mapping = aes(x = reorder_within(Nationality, n, Disneyland),
                           y = n,
                           fill = Disneyland))
p1 <- p1 + geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~Disneyland, scales = "free_y") +
  scale_x_reordered() +
  labs(x = NULL,
       y = "Visitors Numbers",
       title = "Number of visitors to three Disney parks",
       caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p1

ggsave("Pic1.png", plot = p1, path = "./image")
## Saving 7 x 5 in image

As we can see from the chart above, Disneyland in California is most popular with visitors from the United States, Disneyland in Hong Kong is most popular with visitors from Australia, and Disneyland in Paris is most popular with visitors from the United Kingdom. At the same time, it is not difficult to see that among the three Disney parks, Disneyland in California is the most popular among tourists and has the largest number of visitors.

Anaysis and Figure 2

Although California Disneyland has the highest number of visitors among the three parks, it also needs to be looked at in other aspects. Next I’m going to analyze the reviews from the visitors to study the nature of the reviews, whether they are positive or negative.

The reviews that need to be analyzed are processed by breaking them into single words and removing the stop words.

rw_tidy <- dsy_new %>%
  group_by(Disneyland) %>%
  select(Disneyland, Review) %>%
  mutate(linenumber = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, Review) %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`

Use bing to analyze the positivity and negativity of reviews.

bing <- get_sentiments("bing")

rw_tidy %>%
  inner_join(bing) %>%
  count(Disneyland, index = linenumber %/% 50,sentiment)
## Joining with `by = join_by(word)`
## # A tibble: 164 × 4
##    Disneyland            index sentiment     n
##    <chr>                 <dbl> <chr>     <int>
##  1 Disneyland_California     0 negative    128
##  2 Disneyland_California     0 positive    197
##  3 Disneyland_California     1 negative     75
##  4 Disneyland_California     1 positive    208
##  5 Disneyland_California     2 negative    142
##  6 Disneyland_California     2 positive    235
##  7 Disneyland_California     3 negative    100
##  8 Disneyland_California     3 positive    190
##  9 Disneyland_California     4 negative    101
## 10 Disneyland_California     4 positive    162
## # ℹ 154 more rows
rw_sentiment <- rw_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(Disneyland, index = linenumber %/% 50,sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
p2 <- ggplot(data = rw_sentiment,
             mapping = aes(x = index,
                           y = sentiment,
                           color = Disneyland)) +
  geom_line(linewidth = 0.7, show.legend = FALSE) + 
  scale_color_manual(values = c("skyblue3", "slategray3", "lightblue4")) +
  facet_wrap(~Disneyland, ncol = 2, scales = "free_x") +
  labs(x = NULL,
       y = "Sentiment",
       title = "Visitors' Reviews on the Sentiments of the Three Disney parks",
       caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p2

ggsave("Pic2.png", plot = p2, path = "./image")
## Saving 7 x 5 in image

From the above chart, it is not difficult to find that sentiment is positive, so tourists’ evaluation of the three parks is almost all positive. However, it can also be noticed that there are slightly different levels of positive reviews from visitors to the three Disney parks. Hong Kong Disneyland received the most positive reviews, followed by California and finally Paris.

Anaysis and Figure 3

For the operation of Disneyland park, it is also very important to understand the content of tourists’ tour evaluation, so what are the main reviews of different Disneyland parks?

bigrams_separated <- dsy_new %>%
  group_by(Disneyland) %>%
  select(Disneyland, Review) %>%
  unnest_tokens(bigram, Review, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  ungroup() %>%
  separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
 
bigrams_counts_1 <- bigrams_filtered %>%
  filter(Disneyland == "Disneyland_HongKong") %>%
  count(word1, word2, sort = TRUE)
bigrams_counts_2 <- bigrams_filtered %>%
  filter(Disneyland == "Disneyland_California") %>%
  count(word1, word2, sort = TRUE)
bigrams_counts_3 <- bigrams_filtered %>%
  filter(Disneyland == "Disneyland_Paris") %>%
  count(word1, word2, sort = TRUE)

bigrams_graph_1 <- bigrams_counts_1%>%
  filter(n > 20) %>%
  graph_from_data_frame()
bigrams_graph_2 <- bigrams_counts_2%>%
  filter(n > 20) %>%
  graph_from_data_frame()
bigrams_graph_3 <- bigrams_counts_3%>%
  filter(n > 20) %>%
  graph_from_data_frame()

set.seed(2023)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

p3_1 <- ggraph(bigrams_graph_1, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightpink", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() +
  labs(title = "The Relationship Between the Words used in the Visitors' Reviews---Hong Kong",
       caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p3_1
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

set.seed(2023)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

p3_2 <- ggraph(bigrams_graph_2, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightsalmon", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() +
  labs(title = "The Relationship Between the Words used in the Visitors' Reviews---California",
       caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p3_2

set.seed(2023)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

p3_3 <- ggraph(bigrams_graph_3, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightgoldenrod", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() +
  labs(title = "The Relationship Between the Words used in the Visitors' Reviews---Paris",
       caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
  
p3_3

From the above three charts, it can be seen that visitors’ comments on the three Disney parks are similar, all mention Disney characters, time, tickets, and related facilities in the park, All three parks have Mickey Mouse in the reviews, which can be seen as very popular with visitors. The results should also inform the direction of Disneyland’s future operations.

ggsave("Pic3_1.jpg", plot = p3_1, path = "./image")
## Saving 7 x 5 in image
ggsave("Pic3_2.jpg", plot = p3_2, path = "./image")
## Saving 7 x 5 in image
ggsave("Pic3_3.jpg", plot = p3_3, path = "./image")
## Saving 7 x 5 in image