Let’s load the data to be used, because the selected database file is too large, the data has been screened before import. After processing, 4264 pieces of data remain.
dsy <- read.csv("./data/DisneylandReviews.csv")
With the continuous release of Disney’s series of movies, people’s attention to it is also increasing. At the same time, Disneyland parks were established in different countries around the world. Although all thress are Disneyland parks, they seem to receive different reviews from visitors. How popular are Disneyland parks in different regions with visitors?(Number of three Disney visitors) What do visitors from different countries say about different Disneyland parks? (Negative/Positive) What are the main reviews of different Disneyland parks?(Most words/phrases used in comments) I’m going to analyze comments from visitors to Disney parks in three different locations: Paris, California and Hong Kong. I think the analysis results will be of great help to the operation of Disneyland in the future, and relevant improvements can be made through the comments of tourists to pursue better experience.
The dataset includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor. The database contains several columns such as Review_ID(unique id given to each review),Rating(ranging from 1 (unsatisfied) to 5 (satisfied)),Year_Month(when the reviewer visited the theme park),Reviewer_Location(country of origin of visitor),Review_Text(comments made by visitor) and Disneyland_Branch(location of Disneyland Park). These data can be used to analyze the preferences and real experience evaluations of visitors from different countries to Disneyland in different regions. There are a total of 42656 pieces of data and 6 variables.
I will select the columns that need in the analysis and rename them for subsequent operations.
#Select the columns i want to use and rename them
dsy_new <- select(dsy,
Disneyland = Branch,
Nationality = Reviewer_Location,
Review = Review_Text)
After the selection, the dataset left with 3 variables.
Because the selected database is relatively large, I will select 4000 of them for analysis.
dsy_new <- dsy_new %>%
select(Disneyland, Nationality, Review) %>%
mutate(ID = row_number())
dsy_new <- dsy_new %>%
sample_n(4000, replace = F)
dsy_new <- dsy_new %>%
select(Disneyland, Nationality, Review)
After the seletion, 4000 pieces of data remain.
Back up the simplified data.
write.csv(dsy_new, "./data/dsy_new.csv" )
First of all, the ratings of visitors to different areas of Disneyland reflect the popularity of different areas.
dsy_new <- dsy_new %>%
mutate(Nationality =str_replace_all(Nationality, "[^[:graph:]]", " "))
dsy_new %>% count(Nationality)
## Nationality n
## 1 Afghanistan 1
## 2 Argentina 3
## 3 Australia 460
## 4 Austria 2
## 5 Bahrain 3
## 6 Belgium 11
## 7 Bosnia and Herzegovina 2
## 8 Brazil 8
## 9 Brunei 3
## 10 Bulgaria 2
## 11 Canada 202
## 12 Chile 2
## 13 China 20
## 14 Colombia 1
## 15 Croatia 1
## 16 Cyprus 4
## 17 Czechia 3
## 18 Denmark 5
## 19 Egypt 6
## 20 Estonia 2
## 21 Ethiopia 1
## 22 Finland 8
## 23 France 24
## 24 French Polynesia 1
## 25 Germany 23
## 26 Gibraltar 2
## 27 Greece 10
## 28 Guatemala 2
## 29 Hong Kong 72
## 30 Hungary 3
## 31 India 162
## 32 Indonesia 49
## 33 Iran 4
## 34 Ireland 46
## 35 Isle of Man 2
## 36 Israel 9
## 37 Italy 18
## 38 Japan 5
## 39 Kazakhstan 1
## 40 Kenya 2
## 41 Kuwait 5
## 42 Laos 1
## 43 Latvia 1
## 44 Lebanon 4
## 45 Lithuania 1
## 46 Luxembourg 1
## 47 Macau 9
## 48 Malaysia 61
## 49 Maldives 1
## 50 Mali 1
## 51 Malta 8
## 52 Mauritius 2
## 53 Mexico 8
## 54 Monaco 1
## 55 Mongolia 1
## 56 Morocco 1
## 57 Myanmar (Burma) 2
## 58 Namibia 1
## 59 Netherlands 20
## 60 New Zealand 72
## 61 Nigeria 5
## 62 Norway 10
## 63 Oman 1
## 64 Pakistan 3
## 65 Panama 1
## 66 Peru 2
## 67 Philippines 129
## 68 Poland 4
## 69 Portugal 14
## 70 Qatar 5
## 71 Romania 8
## 72 Russia 4
## 73 Saudi Arabia 12
## 74 Seychelles 1
## 75 Singapore 110
## 76 South Africa 16
## 77 South Korea 6
## 78 Spain 15
## 79 Sri Lanka 7
## 80 Sweden 12
## 81 Switzerland 13
## 82 Taiwan 2
## 83 Thailand 24
## 84 Turkey 5
## 85 Ukraine 1
## 86 United Arab Emirates 33
## 87 United Kingdom 894
## 88 United States 1266
## 89 Uruguay 2
## 90 Vanuatu 1
## 91 Vietnam 7
## 92 Zimbabwe 1
#Because there are too many data in the column of countries, it is not clear to display it in a chart. Therefore, the top 30 countries with the largest number will be selected for analysis.
dsy_1 <- dsy_new %>%
group_by(Disneyland) %>%
count(Nationality, sort = T) %>%
slice_max(n, n =20)
I’ll start by comparing the number of visitors to Disney parks in three different regions with a bar chart to see which park visitors prefer.
p1 <- ggplot(data = dsy_1,
mapping = aes(x = reorder_within(Nationality, n, Disneyland),
y = n,
fill = Disneyland))
p1 <- p1 + geom_col(show.legend = F) +
coord_flip() +
facet_wrap(~Disneyland, scales = "free_y") +
scale_x_reordered() +
labs(x = NULL,
y = "Visitors Numbers",
title = "Number of visitors to three Disney parks",
caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p1
ggsave("Pic1.png", plot = p1, path = "./image")
## Saving 7 x 5 in image
As we can see from the chart above, Disneyland in California is most popular with visitors from the United States, Disneyland in Hong Kong is most popular with visitors from Australia, and Disneyland in Paris is most popular with visitors from the United Kingdom. At the same time, it is not difficult to see that among the three Disney parks, Disneyland in California is the most popular among tourists and has the largest number of visitors.
Although California Disneyland has the highest number of visitors among the three parks, it also needs to be looked at in other aspects. Next I’m going to analyze the reviews from the visitors to study the nature of the reviews, whether they are positive or negative.
The reviews that need to be analyzed are processed by breaking them into single words and removing the stop words.
rw_tidy <- dsy_new %>%
group_by(Disneyland) %>%
select(Disneyland, Review) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, Review) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
Use bing to analyze the positivity and negativity of reviews.
bing <- get_sentiments("bing")
rw_tidy %>%
inner_join(bing) %>%
count(Disneyland, index = linenumber %/% 50,sentiment)
## Joining with `by = join_by(word)`
## # A tibble: 164 × 4
## Disneyland index sentiment n
## <chr> <dbl> <chr> <int>
## 1 Disneyland_California 0 negative 128
## 2 Disneyland_California 0 positive 197
## 3 Disneyland_California 1 negative 75
## 4 Disneyland_California 1 positive 208
## 5 Disneyland_California 2 negative 142
## 6 Disneyland_California 2 positive 235
## 7 Disneyland_California 3 negative 100
## 8 Disneyland_California 3 positive 190
## 9 Disneyland_California 4 negative 101
## 10 Disneyland_California 4 positive 162
## # ℹ 154 more rows
rw_sentiment <- rw_tidy %>%
inner_join(get_sentiments("bing")) %>%
count(Disneyland, index = linenumber %/% 50,sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
p2 <- ggplot(data = rw_sentiment,
mapping = aes(x = index,
y = sentiment,
color = Disneyland)) +
geom_line(linewidth = 0.7, show.legend = FALSE) +
scale_color_manual(values = c("skyblue3", "slategray3", "lightblue4")) +
facet_wrap(~Disneyland, ncol = 2, scales = "free_x") +
labs(x = NULL,
y = "Sentiment",
title = "Visitors' Reviews on the Sentiments of the Three Disney parks",
caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p2
ggsave("Pic2.png", plot = p2, path = "./image")
## Saving 7 x 5 in image
From the above chart, it is not difficult to find that sentiment is positive, so tourists’ evaluation of the three parks is almost all positive. However, it can also be noticed that there are slightly different levels of positive reviews from visitors to the three Disney parks. Hong Kong Disneyland received the most positive reviews, followed by California and finally Paris.
For the operation of Disneyland park, it is also very important to understand the content of tourists’ tour evaluation, so what are the main reviews of different Disneyland parks?
bigrams_separated <- dsy_new %>%
group_by(Disneyland) %>%
select(Disneyland, Review) %>%
unnest_tokens(bigram, Review, token = "ngrams", n = 2) %>%
filter(!is.na(bigram)) %>%
ungroup() %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_counts_1 <- bigrams_filtered %>%
filter(Disneyland == "Disneyland_HongKong") %>%
count(word1, word2, sort = TRUE)
bigrams_counts_2 <- bigrams_filtered %>%
filter(Disneyland == "Disneyland_California") %>%
count(word1, word2, sort = TRUE)
bigrams_counts_3 <- bigrams_filtered %>%
filter(Disneyland == "Disneyland_Paris") %>%
count(word1, word2, sort = TRUE)
bigrams_graph_1 <- bigrams_counts_1%>%
filter(n > 20) %>%
graph_from_data_frame()
bigrams_graph_2 <- bigrams_counts_2%>%
filter(n > 20) %>%
graph_from_data_frame()
bigrams_graph_3 <- bigrams_counts_3%>%
filter(n > 20) %>%
graph_from_data_frame()
set.seed(2023)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
p3_1 <- ggraph(bigrams_graph_1, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightpink", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void() +
labs(title = "The Relationship Between the Words used in the Visitors' Reviews---Hong Kong",
caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p3_1
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
set.seed(2023)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
p3_2 <- ggraph(bigrams_graph_2, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightsalmon", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void() +
labs(title = "The Relationship Between the Words used in the Visitors' Reviews---California",
caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p3_2
set.seed(2023)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
p3_3 <- ggraph(bigrams_graph_3, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightgoldenrod", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void() +
labs(title = "The Relationship Between the Words used in the Visitors' Reviews---Paris",
caption = "DATA: 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong")
p3_3
From the above three charts, it can be seen that visitors’ comments on the three Disney parks are similar, all mention Disney characters, time, tickets, and related facilities in the park, All three parks have Mickey Mouse in the reviews, which can be seen as very popular with visitors. The results should also inform the direction of Disneyland’s future operations.
ggsave("Pic3_1.jpg", plot = p3_1, path = "./image")
## Saving 7 x 5 in image
ggsave("Pic3_2.jpg", plot = p3_2, path = "./image")
## Saving 7 x 5 in image
ggsave("Pic3_3.jpg", plot = p3_3, path = "./image")
## Saving 7 x 5 in image