About Disney? Disneyland Park is most certainly a land of enchantment where children and the young at heart find that dreams really do come true. Disneyland Park is a seamless blend of yesterday, today and tomorrow. Smiling train conductors, marching bands and the clip-clop of horse drawn carriages bring you back to carefree days. You can also rocket through the galaxy, board a runaway train or trek through the jungle. Laughter is always in the air, with friendly smiles all around. The secret to Disneyland is its ability to change yet, remain the same. As Walt Disney said “Disneyland will never be completed. It will continue to grow as long as there is imagination left in the world.”
We aim to derive insightful information about Disneyland through this dataset, analyze reviews to draw meaningful conclusions, and conduct sentiment analysis on the visitor feedback
Column Description:
# library used
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
data = "/Users/richarddiaz/Desktop/DisneylandReviews.csv"
df=pd.read_csv(data,encoding="cp1252")
df
| Review_ID | Rating | Year_Month | Reviewer_Location | Review_Text | Branch | |
|---|---|---|---|---|---|---|
| 0 | 670772142 | 4 | 2019-4 | Australia | If you’ve ever been to Disneyland anywhere you… | Disneyland_HongKong |
| 1 | 670682799 | 4 | 2019-5 | Philippines | Its been a while since d last time we visit HK… | Disneyland_HongKong |
| 2 | 670623270 | 4 | 2019-4 | United Arab Emirates | Thanks God it wasn t too hot or too humid wh… | Disneyland_HongKong |
| 3 | 670607911 | 4 | 2019-4 | Australia | HK Disneyland is a great compact park. Unfortu… | Disneyland_HongKong |
| 4 | 670607296 | 4 | 2019-4 | United Kingdom | the location is not in the city, took around 1… | Disneyland_HongKong |
| … | … | … | … | … | … | … |
| 42651 | 1765031 | 5 | missing | United Kingdom | i went to disneyland paris in july 03 and thou… | Disneyland_Paris |
| 42652 | 1659553 | 5 | missing | Canada | 2 adults and 1 child of 11 visited Disneyland … | Disneyland_Paris |
| 42653 | 1645894 | 5 | missing | South Africa | My eleven year old daughter and myself went to… | Disneyland_Paris |
| 42654 | 1618637 | 4 | missing | United States | This hotel, part of the Disneyland Paris compl… | Disneyland_Paris |
| 42655 | 1536786 | 4 | missing | United Kingdom | I went to the Disneyparis resort, in 1996, wit… | Disneyland_Paris |
42656 rows × 6 columns
## cleaning for missing years
df=df.loc[df['Year_Month']=='missing']
df
| Review_ID | Rating | Year_Month | Reviewer_Location | Review_Text | Branch | |
|---|---|---|---|---|---|---|
| 269 | 647038712 | 4 | missing | Philippines | The first thing on our agenda when we finished… | Disneyland_HongKong |
| 282 | 646466731 | 3 | missing | Singapore | Brought mum for the first time to Disneyland w… | Disneyland_HongKong |
| 622 | 620580249 | 3 | missing | Canada | I have been to Tokyo and LA Disneyland!I also … | Disneyland_HongKong |
| 5347 | 318799221 | 4 | missing | Australia | We pre bought tickets at the hotel (same price… | Disneyland_HongKong |
| 5799 | 284745152 | 5 | missing | Philippines | Disneyland is indeed the most magical place in… | Disneyland_HongKong |
| … | … | … | … | … | … | … |
| 42651 | 1765031 | 5 | missing | United Kingdom | i went to disneyland paris in july 03 and thou… | Disneyland_Paris |
| 42652 | 1659553 | 5 | missing | Canada | 2 adults and 1 child of 11 visited Disneyland … | Disneyland_Paris |
| 42653 | 1645894 | 5 | missing | South Africa | My eleven year old daughter and myself went to… | Disneyland_Paris |
| 42654 | 1618637 | 4 | missing | United States | This hotel, part of the Disneyland Paris compl… | Disneyland_Paris |
| 42655 | 1536786 | 4 | missing | United Kingdom | I went to the Disneyparis resort, in 1996, wit… | Disneyland_Paris |
2613 rows × 6 columns
#from previous dataframe up above we can use to match 'missing'
df=pd.read_csv(data,encoding="cp1252",na_values=['missing'])
## checking for null values
df=df.dropna().reset_index()
print ("\nMissing values : ", df.isnull().sum().values.sum())
Missing values : 0
## dropping any duplicate entries
df.drop_duplicates(subset='Review_Text', inplace=True, keep='first')
## data summary
print ("Rows : " ,df.shape[0])
print ("Columns : " ,df.shape[1])
print ("\nFeatures : \n" ,df.columns.tolist())
print ("\nMissing values : ", df.isnull().sum().values.sum())
print ("\nUnique values : \n",df.nunique())
Rows : 40022
Columns : 7
Features :
['index', 'Review_ID', 'Rating', 'Year_Month', 'Reviewer_Location', 'Review_Text', 'Branch']
Missing values : 0
Unique values :
index 40022
Review_ID 40014
Rating 5
Year_Month 111
Reviewer_Location 162
Review_Text 40022
Branch 3
dtype: int64
The dataset consists of 40,022 rows and 7 columns with no missing values, capturing unique Disneyland reviews (Review_Text, Review_ID) across three branches, rated on a 5-point scale (Rating). Despite reviews coming from 162 different locations and being unique in content and ID, they all correspond to the same period (Year_Month)
#review per branch
df['Branch'].value_counts()
Disneyland_California 18196
Disneyland_Paris 12691
Disneyland_HongKong 9135
Name: Branch, dtype: int64
The data shows the number of reviews for each Disneyland branch: 18,196 for Disneyland California, 12,691 for Disneyland Paris, and 9,135 for Disneyland Hong Kong.
Next, we want to be able to split some of value columns for year and month to be able to get quarter features. This would mimic quarterly results for park sentiments. Typically, quarterly earnings given some benchmark of how well parks are doing to improve or highlight specific reviews.
# new data frame with split value columns
new = df["Year_Month"].str.split("-", n = 1, expand = True)
# making separate year column from new data frame
df["year"]= new[0]
# making separate month column from new data frame
df["month"]= new[1]
# Dropping old feature
df.drop(columns =["Year_Month"], inplace = True)
# Keep month as integer value
df['month']=df['month'].astype('int64')
# Quarterly extract - # labda allows us to break into quarter results using if and else statement
df['quarter']=df['month'].apply(lambda x:1 if x<=3 else (2 if 3<x<=6 else (3 if 6<x<=9 else (4 if 9<x<=12 else x))))
## review text of array
df['Review_Text'].values[2:3]
array(['Thanks God it wasn t too hot or too humid when I was visiting the park otherwise it would be a big issue (there is not a lot of shade).I have arrived around 10:30am and left at 6pm. Unfortunately I didn t last until evening parade, but 8.5 hours was too much for me.There is plenty to do and everyone will find something interesting for themselves to enjoy.It wasn t extremely busy and the longest time I had to queue for certain attractions was 45 minutes (which is really not that bad).Although I had an amazing time, I felt a bit underwhelmed with choice of rides and attractions. The park itself is quite small (I was really expecting something grand even the main castle which was closed by the way was quite small).The food options are good, few coffee shops (including Starbucks) and plenty of gift shops. There was no issue with toilets as they are everywhere.All together it was a great day out and I really enjoyed it.'],
dtype=object)
# review length of array
df['review length']=df['Review_Text'].apply(lambda x:len(x))
#pip install wordcloud
Collecting wordcloud
Downloading wordcloud-1.9.3-cp39-cp39-macosx_10_9_x86_64.whl (173 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.2/173.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: numpy>=1.6.1 in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (1.21.5)
Requirement already satisfied: matplotlib in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (3.5.2)
Requirement already satisfied: pillow in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (9.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (1.4.2)
Requirement already satisfied: packaging>=20.0 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (21.3)
Requirement already satisfied: cycler>=0.10 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (4.25.0)
Requirement already satisfied: pyparsing>=2.2.1 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in ./opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.3
Note: you may need to restart the kernel to use updated packages.
## recapping library used:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator
import re
from PIL import Image
#==
features = df['Review_Text'].values
#===
processed_features = []
for sentence in range(0, len(features)):
# Remove all the Http: urls
processed_feature = re.sub('(https?://\S+)', '', str(features[sentence]))
# Remove all the special characters
processed_feature = re.sub(r'\W', ' ', processed_feature)
# Remove all single characters
processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
# Remove single characters from the start
processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)
# Substituting multiple spaces with single space
processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
# Removing prefixed 'b'
processed_feature = re.sub(r'^b\s+', '', processed_feature)
# Converting to Lowercase
processed_feature = processed_feature.lower()
processed_features.append(processed_feature)
# creating new dataframe for wordcloud
df3=pd.DataFrame()
df3['reviews']=processed_features
#pip install textblob
Collecting textblob
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting nltk>=3.8
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hRequirement already satisfied: click in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (8.0.4)
Requirement already satisfied: joblib in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (1.1.0)
Requirement already satisfied: tqdm in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (4.64.1)
Requirement already satisfied: regex>=2021.8.3 in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (2022.7.9)
Installing collected packages: nltk, textblob
Attempting uninstall: nltk
Found existing installation: nltk 3.7
Uninstalling nltk-3.7:
Successfully uninstalled nltk-3.7
Successfully installed nltk-3.8.1 textblob-0.18.0.post0
Note: you may need to restart the kernel to use updated packages.
from textblob import TextBlob
from wordcloud import WordCloud
# Create a function to get the subjectivity
def getSubjectivity(text):
return TextBlob(text).sentiment.subjectivity
# Create a function to get the polarity
def getPolarity(text):
return TextBlob(text).sentiment.polarity
# Create two new columns 'Subjectivity' & 'Polarity'
df3['Subjectivity'] = df3['reviews'].apply(getSubjectivity)
df3['Polarity'] = df3['reviews'].apply(getPolarity)
# Score determination for each review
def getAnalysis(score):
if score < 0: # if score is less than zero -> negative
return 'Negative'
elif score == 0: # if score is = zero -> neutral
return 'Neutral'
else: # if score is otherwise -> positive
return 'Positive'
# captured scored into new dataframe
df3['Analysis'] = df3['Polarity'].apply(getAnalysis)
df3
| reviews | Subjectivity | Polarity | Analysis | |
|---|---|---|---|---|
| 0 | if you ve ever been to disneyland anywhere you… | 0.561481 | 0.239352 | Positive |
| 1 | its been while since last time we visit hk dis… | 0.459783 | 0.205797 | Positive |
| 2 | thanks god it wasn too hot or too humid when w… | 0.434857 | 0.119238 | Positive |
| 3 | hk disneyland is great compact park unfortunat… | 0.512143 | 0.189286 | Positive |
| 4 | the location is not in the city took around 1 … | 0.437500 | 0.266667 | Positive |
| … | … | … | … | … |
| 40017 | although our pick up was prompt the taxi drive… | 0.470556 | 0.034402 | Positive |
| 40018 | just returned from 4 days family trip to disne… | 0.437991 | 0.202937 | Positive |
| 40019 | we spent the 20 dec 2010 in the disney park an… | 0.493521 | 0.020628 | Positive |
| 40020 | well was really looking forward to this trip o… | 0.497893 | 0.125890 | Positive |
| 40021 | if staying at disney hotel make good use of yo… | 0.445000 | 0.115000 | Positive |
40022 rows × 4 columns
# sentiment analysis , ratings, and reviews
df['sentiment']=df3['Analysis'] #analysis
df['Sentiment']=df['Rating'].apply(lambda x:'Negative' if x<3 else np.nan) #check for negative using previous lambda for ratings
df['Sentiment']=df['Sentiment'].fillna(df['sentiment']) #sentiment score
df['Reviews_Text']=df3['reviews']
# datafram drop sentiment and review id since it is not necessary for now
df=df.drop(['sentiment','Review_ID'],axis=1)
df_copy=df.copy()
df.head()
| index | Rating | Reviewer_Location | Review_Text | Branch | year | month | quarter | review length | Sentiment | Reviews_Text | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 4 | Australia | If you’ve ever been to Disneyland anywhere you… | Disneyland_HongKong | 2019 | 4 | 2 | 329 | Positive | if you ve ever been to disneyland anywhere you… |
| 1 | 1 | 4 | Philippines | Its been a while since d last time we visit HK… | Disneyland_HongKong | 2019 | 5 | 2 | 970 | Positive | its been while since last time we visit hk dis… |
| 2 | 2 | 4 | United Arab Emirates | Thanks God it wasn t too hot or too humid wh… | Disneyland_HongKong | 2019 | 4 | 2 | 938 | Positive | thanks god it wasn too hot or too humid when w… |
| 3 | 3 | 4 | Australia | HK Disneyland is a great compact park. Unfortu… | Disneyland_HongKong | 2019 | 4 | 2 | 485 | Positive | hk disneyland is great compact park unfortunat… |
| 4 | 4 | 4 | United Kingdom | the location is not in the city, took around 1… | Disneyland_HongKong | 2019 | 4 | 2 | 163 | Positive | the location is not in the city took around 1 … |
import seaborn as sns
import matplotlib.pyplot as plt
# Set the aesthetics for the plot
sns.set_style("whitegrid")
sns.set_context("talk") # Larger font size
# Group the data and sum the review lengths by year
df3 = df.groupby('year', as_index=False).agg({'review length': 'sum'})
# Create the plot
plt.figure(figsize=(14, 7))
plt.plot(df3['year'], df3['review length'], marker='o', linestyle='-', label='Review Length by Year', color='blue')
# Add title and labels
plt.title('Review Length Summarized by Year', fontsize=20)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Total Review Length', fontsize=16)
# Show grid lines
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
# Add legend
plt.legend(title='Legend', title_fontsize='13', fontsize='12', loc='upper left')
# Show the plot
plt.show()
We can see reviews start to increase by 2015 and decrease after.
# Set the figure size
plt.figure(figsize=(10, 6))
# Create a countplot
sns.countplot(data=df, x='year', hue='Branch', palette='Blues')
# Calculate the moving average
window_size = 3 # Define the window size for the moving average
# Calculate the moving average for each branch
for branch in df['Branch'].unique():
branch_data = df[df['Branch'] == branch]
yearly_counts = branch_data.groupby('year').size().rolling(window=window_size).mean()
plt.plot(yearly_counts.index, yearly_counts, label=f'{branch} MA', marker='o')
# Add labels
plt.xlabel('Year')
plt.ylabel('Count')
plt.title('Yearly Count by Branch with Moving Average')
# Add legend
plt.legend(fontsize=8) # adjust here legend size
# Show the plot
plt.show()
# Create a figure with two subplots side-by-side and set a larger overall figure size for better readability
fig, axes = plt.subplots(1, 2, figsize=(18, 6), sharey=False)
# Plotting the first subplot: Quarter by Ratings
sns.countplot(ax=axes[0], data=df, x='quarter', hue='Rating', palette='Set2')
axes[0].set_title('Quarter By Ratings', fontsize=16)
axes[0].set_xlabel('Quarter', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].grid(True, which='both', linestyle='--', linewidth=0.5)
axes[0].legend(title='Rating', title_fontsize=12, fontsize=10)
# Plotting the second subplot: Quarter per reviews
sns.countplot(ax=axes[1], data=df, x='quarter', palette='coolwarm')
axes[1].set_title('Quarter per Reviews', fontsize=16)
axes[1].set_xlabel('Quarter', fontsize=14)
axes[1].set_ylabel('Count', fontsize=14)
axes[1].grid(True, which='both', linestyle='--', linewidth=0.5)
# Adjust the layout to prevent overlap and ensure clarity
plt.tight_layout()
# Show the plot
plt.show()
From a quaterly perspective, our reviews are low for the first quarter.
df4 = df.groupby(['Reviewer_Location'],as_index=False).agg({'Rating':'mean'}).sort_values(by='Rating', ascending=True).head(10)
df4.style.background_gradient(cmap="autumn", subset=['Rating'])
| Reviewer_Location | Rating | |
|---|---|---|
| 3 | Andorra | 2.000000 |
| 147 | Turks and Caicos Islands | 2.000000 |
| 132 | South Sudan | 2.000000 |
| 136 | Suriname | 2.000000 |
| 39 | Ecuador | 2.333333 |
| 58 | Haiti | 3.000000 |
| 129 | Solomon Islands | 3.000000 |
| 107 | Northern Mariana Islands | 3.000000 |
| 36 | Democratic Republic of the Congo | 3.000000 |
| 128 | Slovenia | 3.000000 |
df4 = df.groupby(['Reviewer_Location'], as_index=False).agg({'Rating': 'mean'}).sort_values(by='Rating', ascending=False).head(10)
df4.style.background_gradient(cmap="autumn", subset=['Rating'])
| Reviewer_Location | Rating | |
|---|---|---|
| 81 | Libya | 5.000000 |
| 23 | Caribbean Netherlands | 5.000000 |
| 31 | Cuba | 5.000000 |
| 32 | Curaçao | 5.000000 |
| 43 | Ethiopia | 5.000000 |
| 44 | Falkland Islands (Islas Malvinas) | 5.000000 |
| 49 | Georgia | 5.000000 |
| 54 | Grenada | 5.000000 |
| 66 | Iraq | 5.000000 |
| 89 | Mali | 5.000000 |
# Set the plot style
sns.set_style("darkgrid")
# Initialize the figure
plt.figure(figsize=(14, 7))
# Define colors for the plot lines
colors = plt.cm.viridis(np.linspace(0, 1, 5))
# Loop through the rating values to aggregate and plot data
for i, color in zip(range(1, 6), colors):
subset = df.loc[df['Rating'] == i]
grouped = subset.groupby('year', as_index=False).agg({'Rating': 'sum'})
plt.plot(grouped['year'], grouped['Rating'], marker='o', label=str(i), color=color)
# Add legend, labels, and title
plt.legend(title='Rating')
plt.xlabel('Year')
plt.ylabel('Sum of Ratings')
plt.title('Sum of Ratings by Year')
# Show the plot
plt.show()
plt.figure(figsize=(10,6))
sns.countplot(data=df,x='Branch',hue='Sentiment',palette='inferno');
# Define the branches to analyze
branches = ['Disneyland_California', 'Disneyland_Paris', 'Disneyland_HongKong']
colors = ['#228B22', '#CC0000', '#00BFFF'] # Define a common color scheme for the pie charts
# Initialize the subplot
fig, axes = plt.subplots(1, 3, figsize=(20, 7))
fig.suptitle('Branches Sentiment Distribution')
# Define text properties for pie chart labels to make them bold
textprops = {"weight": "bold"} # making the text bold
# Loop through each branch and create the pie charts
for ax, branch in zip(axes, branches):
# Filter the data frame by branch and get the sentiment counts
sentiment_counts = df.loc[df['Branch'] == branch, 'Sentiment'].value_counts(sort=True)
labels = sentiment_counts.index
sizes = sentiment_counts.values
# Create the pie chart for each branch
ax.pie(sizes, labels=labels, startangle=90, shadow=True, autopct='%1.2f%%', colors=colors, textprops=textprops)
ax.set_title(branch.split('_')[1]) # Set the title to the branch name
plt.show()
# Convert the 'Reviews_Text' column to a single string
df['Reviews_Text'] = df['Reviews_Text'].astype('str')
reviews_text = " ".join(txt for txt in df['Reviews_Text'])
# Create a WordCloud object without an image mask
wc = WordCloud(background_color='white',
mode='RGB', width=1000, max_words=1000, height=1000,
random_state=1, contour_width=1, contour_color='black', colormap='flag')
# Generate the word cloud
wc.generate(reviews_text)
# Display the word cloud
plt.figure(figsize=(20, 10))
plt.imshow(wc, interpolation='bilinear')
plt.tight_layout(pad=0)
plt.axis('off') # Remove the axes
plt.show()
There are many variables to try to predict views. From a hollistic or macro perspective it helpful to have an overall picture of how the overall park is doing. Never the less, there are some good reviews based on food and character. I could have been more detailed on the word cloud. However, it does give you a good overall understanding of important themes or ideas to start looking into.