Disneyland

About Disney? Disneyland Park is most certainly a land of enchantment where children and the young at heart find that dreams really do come true. Disneyland Park is a seamless blend of yesterday, today and tomorrow. Smiling train conductors, marching bands and the clip-clop of horse drawn carriages bring you back to carefree days. You can also rocket through the galaxy, board a runaway train or trek through the jungle. Laughter is always in the air, with friendly smiles all around. The secret to Disneyland is its ability to change yet, remain the same. As Walt Disney said “Disneyland will never be completed. It will continue to grow as long as there is imagination left in the world.”

Goal:

We aim to derive insightful information about Disneyland through this dataset, analyze reviews to draw meaningful conclusions, and conduct sentiment analysis on the visitor feedback

About the Dataset

Column Description:

Importing Dataset

# library used
import numpy as np 
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
data = "/Users/richarddiaz/Desktop/DisneylandReviews.csv"
df=pd.read_csv(data,encoding="cp1252")
df
Review_ID Rating Year_Month Reviewer_Location Review_Text Branch
0 670772142 4 2019-4 Australia If you’ve ever been to Disneyland anywhere you… Disneyland_HongKong
1 670682799 4 2019-5 Philippines Its been a while since d last time we visit HK… Disneyland_HongKong
2 670623270 4 2019-4 United Arab Emirates Thanks God it wasn t too hot or too humid wh… Disneyland_HongKong
3 670607911 4 2019-4 Australia HK Disneyland is a great compact park. Unfortu… Disneyland_HongKong
4 670607296 4 2019-4 United Kingdom the location is not in the city, took around 1… Disneyland_HongKong
42651 1765031 5 missing United Kingdom i went to disneyland paris in july 03 and thou… Disneyland_Paris
42652 1659553 5 missing Canada 2 adults and 1 child of 11 visited Disneyland … Disneyland_Paris
42653 1645894 5 missing South Africa My eleven year old daughter and myself went to… Disneyland_Paris
42654 1618637 4 missing United States This hotel, part of the Disneyland Paris compl… Disneyland_Paris
42655 1536786 4 missing United Kingdom I went to the Disneyparis resort, in 1996, wit… Disneyland_Paris

42656 rows × 6 columns

## cleaning for missing years
df=df.loc[df['Year_Month']=='missing']
df
Review_ID Rating Year_Month Reviewer_Location Review_Text Branch
269 647038712 4 missing Philippines The first thing on our agenda when we finished… Disneyland_HongKong
282 646466731 3 missing Singapore Brought mum for the first time to Disneyland w… Disneyland_HongKong
622 620580249 3 missing Canada I have been to Tokyo and LA Disneyland!I also … Disneyland_HongKong
5347 318799221 4 missing Australia We pre bought tickets at the hotel (same price… Disneyland_HongKong
5799 284745152 5 missing Philippines Disneyland is indeed the most magical place in… Disneyland_HongKong
42651 1765031 5 missing United Kingdom i went to disneyland paris in july 03 and thou… Disneyland_Paris
42652 1659553 5 missing Canada 2 adults and 1 child of 11 visited Disneyland … Disneyland_Paris
42653 1645894 5 missing South Africa My eleven year old daughter and myself went to… Disneyland_Paris
42654 1618637 4 missing United States This hotel, part of the Disneyland Paris compl… Disneyland_Paris
42655 1536786 4 missing United Kingdom I went to the Disneyparis resort, in 1996, wit… Disneyland_Paris

2613 rows × 6 columns

#from previous dataframe up above we can use to match 'missing'
df=pd.read_csv(data,encoding="cp1252",na_values=['missing'])

## checking for null values 
df=df.dropna().reset_index()
print ("\nMissing values :  ", df.isnull().sum().values.sum())
Missing values :   0
## dropping any duplicate entries
df.drop_duplicates(subset='Review_Text', inplace=True, keep='first')
## data summary 
print ("Rows     : " ,df.shape[0])
print ("Columns  : " ,df.shape[1])
print ("\nFeatures : \n" ,df.columns.tolist())
print ("\nMissing values :  ", df.isnull().sum().values.sum())
print ("\nUnique values :  \n",df.nunique())
Rows     :  40022
Columns  :  7

Features : 
 ['index', 'Review_ID', 'Rating', 'Year_Month', 'Reviewer_Location', 'Review_Text', 'Branch']

Missing values :   0

Unique values :  
 index                40022
Review_ID            40014
Rating                   5
Year_Month             111
Reviewer_Location      162
Review_Text          40022
Branch                   3
dtype: int64

The dataset consists of 40,022 rows and 7 columns with no missing values, capturing unique Disneyland reviews (Review_Text, Review_ID) across three branches, rated on a 5-point scale (Rating). Despite reviews coming from 162 different locations and being unique in content and ID, they all correspond to the same period (Year_Month)

#review per branch
df['Branch'].value_counts()
Disneyland_California    18196
Disneyland_Paris         12691
Disneyland_HongKong       9135
Name: Branch, dtype: int64

The data shows the number of reviews for each Disneyland branch: 18,196 for Disneyland California, 12,691 for Disneyland Paris, and 9,135 for Disneyland Hong Kong.

Exploration - Text Length

Next, we want to be able to split some of value columns for year and month to be able to get quarter features. This would mimic quarterly results for park sentiments. Typically, quarterly earnings given some benchmark of how well parks are doing to improve or highlight specific reviews.

# new data frame with split value columns 
new = df["Year_Month"].str.split("-", n = 1, expand = True) 
  
# making separate year column from new data frame 
df["year"]= new[0] 
  
# making separate month column from new data frame 
df["month"]= new[1] 
  
# Dropping old feature  
df.drop(columns =["Year_Month"], inplace = True) 

# Keep month as integer value 
df['month']=df['month'].astype('int64')
# Quarterly extract - # labda allows us to break into quarter results using if and else statement
df['quarter']=df['month'].apply(lambda x:1 if x<=3 else (2 if 3<x<=6 else (3 if 6<x<=9 else (4 if 9<x<=12 else x))))
## review text of array 
df['Review_Text'].values[2:3]
array(['Thanks God it wasn   t too hot or too humid when I was visiting the park   otherwise it would be a big issue (there is not a lot of shade).I have arrived around 10:30am and left at 6pm. Unfortunately I didn   t last until evening parade, but 8.5 hours was too much for me.There is plenty to do and everyone will find something interesting for themselves to enjoy.It wasn   t extremely busy and the longest time I had to queue for certain attractions was 45 minutes (which is really not that bad).Although I had an amazing time, I felt a bit underwhelmed with choice of rides and attractions. The park itself is quite small (I was really expecting something grand   even the main castle which was closed by the way was quite small).The food options are good, few coffee shops (including Starbucks) and plenty of gift shops. There was no issue with toilets as they are everywhere.All together it was a great day out and I really enjoyed it.'],
      dtype=object)
# review length of array
df['review length']=df['Review_Text'].apply(lambda x:len(x))
#pip install wordcloud
Collecting wordcloud
  Downloading wordcloud-1.9.3-cp39-cp39-macosx_10_9_x86_64.whl (173 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 173.2/173.2 kB 4.7 MB/s eta 0:00:00
[?25hRequirement already satisfied: numpy>=1.6.1 in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (1.21.5)
Requirement already satisfied: matplotlib in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (3.5.2)
Requirement already satisfied: pillow in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (9.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (1.4.2)
Requirement already satisfied: packaging>=20.0 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (21.3)
Requirement already satisfied: cycler>=0.10 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (4.25.0)
Requirement already satisfied: pyparsing>=2.2.1 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in ./opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.3
Note: you may need to restart the kernel to use updated packages.
## recapping library used: 
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator
import re
from PIL import Image
#==
features = df['Review_Text'].values
#=== 
processed_features = []

for sentence in range(0, len(features)):
    # Remove all the Http: urls
    processed_feature = re.sub('(https?://\S+)', '', str(features[sentence]))
    
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', processed_feature)

    # Remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)
# creating new dataframe for wordcloud 
df3=pd.DataFrame()
df3['reviews']=processed_features
#pip install textblob
Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 626.3/626.3 kB 9.5 MB/s eta 0:00:00a 0:00:01
[?25hCollecting nltk>=3.8
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 42.4 MB/s eta 0:00:0000:01
[?25hRequirement already satisfied: click in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (8.0.4)
Requirement already satisfied: joblib in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (1.1.0)
Requirement already satisfied: tqdm in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (4.64.1)
Requirement already satisfied: regex>=2021.8.3 in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (2022.7.9)
Installing collected packages: nltk, textblob
  Attempting uninstall: nltk
    Found existing installation: nltk 3.7
    Uninstalling nltk-3.7:
      Successfully uninstalled nltk-3.7
Successfully installed nltk-3.8.1 textblob-0.18.0.post0
Note: you may need to restart the kernel to use updated packages.
from textblob import TextBlob
from wordcloud import WordCloud

Creating Function calls for: Subjectivity and Polarity

# Create a function to get the subjectivity
def getSubjectivity(text):
   return TextBlob(text).sentiment.subjectivity
# Create a function to get the polarity
def getPolarity(text):
   return  TextBlob(text).sentiment.polarity
# Create two new columns 'Subjectivity' & 'Polarity'
df3['Subjectivity'] = df3['reviews'].apply(getSubjectivity)
df3['Polarity'] = df3['reviews'].apply(getPolarity)
# Score determination for each review
def getAnalysis(score):
 if score < 0: # if score is less than zero -> negative
  return 'Negative'
 elif score == 0: # if score is = zero -> neutral
  return 'Neutral'
 else: # if score is otherwise -> positive
  return 'Positive'
# captured scored into new dataframe 
df3['Analysis'] = df3['Polarity'].apply(getAnalysis)
df3
reviews Subjectivity Polarity Analysis
0 if you ve ever been to disneyland anywhere you… 0.561481 0.239352 Positive
1 its been while since last time we visit hk dis… 0.459783 0.205797 Positive
2 thanks god it wasn too hot or too humid when w… 0.434857 0.119238 Positive
3 hk disneyland is great compact park unfortunat… 0.512143 0.189286 Positive
4 the location is not in the city took around 1 … 0.437500 0.266667 Positive
40017 although our pick up was prompt the taxi drive… 0.470556 0.034402 Positive
40018 just returned from 4 days family trip to disne… 0.437991 0.202937 Positive
40019 we spent the 20 dec 2010 in the disney park an… 0.493521 0.020628 Positive
40020 well was really looking forward to this trip o… 0.497893 0.125890 Positive
40021 if staying at disney hotel make good use of yo… 0.445000 0.115000 Positive

40022 rows × 4 columns

Sentiment Reviews

# sentiment analysis , ratings, and reviews 
df['sentiment']=df3['Analysis'] #analysis 
df['Sentiment']=df['Rating'].apply(lambda x:'Negative' if x<3 else np.nan) #check for negative using previous lambda for ratings
df['Sentiment']=df['Sentiment'].fillna(df['sentiment']) #sentiment score
df['Reviews_Text']=df3['reviews']
# datafram drop sentiment and review id since it is not necessary for now
df=df.drop(['sentiment','Review_ID'],axis=1)
df_copy=df.copy()
df.head()
index Rating Reviewer_Location Review_Text Branch year month quarter review length Sentiment Reviews_Text
0 0 4 Australia If you’ve ever been to Disneyland anywhere you… Disneyland_HongKong 2019 4 2 329 Positive if you ve ever been to disneyland anywhere you…
1 1 4 Philippines Its been a while since d last time we visit HK… Disneyland_HongKong 2019 5 2 970 Positive its been while since last time we visit hk dis…
2 2 4 United Arab Emirates Thanks God it wasn t too hot or too humid wh… Disneyland_HongKong 2019 4 2 938 Positive thanks god it wasn too hot or too humid when w…
3 3 4 Australia HK Disneyland is a great compact park. Unfortu… Disneyland_HongKong 2019 4 2 485 Positive hk disneyland is great compact park unfortunat…
4 4 4 United Kingdom the location is not in the city, took around 1… Disneyland_HongKong 2019 4 2 163 Positive the location is not in the city took around 1 …

Changes from previous dataframe:

  • Temporal Breakdown: The date information has been expanded from Year_Month to separate year, month, and quarter columns, offering a more granular temporal analysis.
  • Review Analysis Enhancements: A new column review length has been added, quantifying the length of each review, which can be useful for correlating review length with sentiment or ratings.
  • Sentiment Analysis: A Sentiment column has been introduced, providing a pre-analyzed sentiment (e.g., Positive) for each review, aiding in quick sentiment trend analysis.
  • Text Normalization: The Reviews_Text column has been normalized or simplified version of Review_Text to process for consistency or to facilitate text analysis and cleaning text.

Data Analysis

Reviews - Rolling Years

import seaborn as sns
import matplotlib.pyplot as plt

# Set the aesthetics for the plot
sns.set_style("whitegrid")
sns.set_context("talk")  # Larger font size

# Group the data and sum the review lengths by year
df3 = df.groupby('year', as_index=False).agg({'review length': 'sum'})

# Create the plot
plt.figure(figsize=(14, 7))
plt.plot(df3['year'], df3['review length'], marker='o', linestyle='-', label='Review Length by Year', color='blue')

# Add title and labels
plt.title('Review Length Summarized by Year', fontsize=20)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Total Review Length', fontsize=16)

# Show grid lines
plt.grid(True, which='both', linestyle='--', linewidth=0.5)

# Add legend
plt.legend(title='Legend', title_fontsize='13', fontsize='12', loc='upper left')

# Show the plot
plt.show()
png
png

We can see reviews start to increase by 2015 and decrease after.

Branch Comparisons Per Year

# Set the figure size
plt.figure(figsize=(10, 6))

# Create a countplot
sns.countplot(data=df, x='year', hue='Branch', palette='Blues')

# Calculate the moving average
window_size = 3  # Define the window size for the moving average

# Calculate the moving average for each branch
for branch in df['Branch'].unique():
    branch_data = df[df['Branch'] == branch]
    yearly_counts = branch_data.groupby('year').size().rolling(window=window_size).mean()
    plt.plot(yearly_counts.index, yearly_counts, label=f'{branch} MA', marker='o')

# Add labels
plt.xlabel('Year')
plt.ylabel('Count')
plt.title('Yearly Count by Branch with Moving Average')

# Add legend
plt.legend(fontsize=8)  # adjust here legend size

# Show the plot
plt.show()
png
png

Review Length - Quarterly Review

# Create a figure with two subplots side-by-side and set a larger overall figure size for better readability
fig, axes = plt.subplots(1, 2, figsize=(18, 6), sharey=False)

# Plotting the first subplot: Quarter by Ratings
sns.countplot(ax=axes[0], data=df, x='quarter', hue='Rating', palette='Set2')
axes[0].set_title('Quarter By Ratings', fontsize=16)
axes[0].set_xlabel('Quarter', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].grid(True, which='both', linestyle='--', linewidth=0.5)
axes[0].legend(title='Rating', title_fontsize=12, fontsize=10)

# Plotting the second subplot: Quarter per reviews
sns.countplot(ax=axes[1], data=df, x='quarter', palette='coolwarm')
axes[1].set_title('Quarter per Reviews', fontsize=16)
axes[1].set_xlabel('Quarter', fontsize=14)
axes[1].set_ylabel('Count', fontsize=14)
axes[1].grid(True, which='both', linestyle='--', linewidth=0.5)

# Adjust the layout to prevent overlap and ensure clarity
plt.tight_layout()

# Show the plot
plt.show()
png
png

From a quaterly perspective, our reviews are low for the first quarter.

Country: Lowest rates vs Highest rates

df4 = df.groupby(['Reviewer_Location'],as_index=False).agg({'Rating':'mean'}).sort_values(by='Rating', ascending=True).head(10)
df4.style.background_gradient(cmap="autumn", subset=['Rating'])
  Reviewer_Location Rating
3 Andorra 2.000000
147 Turks and Caicos Islands 2.000000
132 South Sudan 2.000000
136 Suriname 2.000000
39 Ecuador 2.333333
58 Haiti 3.000000
129 Solomon Islands 3.000000
107 Northern Mariana Islands 3.000000
36 Democratic Republic of the Congo 3.000000
128 Slovenia 3.000000
df4 = df.groupby(['Reviewer_Location'], as_index=False).agg({'Rating': 'mean'}).sort_values(by='Rating', ascending=False).head(10)
df4.style.background_gradient(cmap="autumn", subset=['Rating'])
  Reviewer_Location Rating
81 Libya 5.000000
23 Caribbean Netherlands 5.000000
31 Cuba 5.000000
32 Curaçao 5.000000
43 Ethiopia 5.000000
44 Falkland Islands (Islas Malvinas) 5.000000
49 Georgia 5.000000
54 Grenada 5.000000
66 Iraq 5.000000
89 Mali 5.000000

Rates Per Year



# Set the plot style
sns.set_style("darkgrid")

# Initialize the figure
plt.figure(figsize=(14, 7))

# Define colors for the plot lines
colors = plt.cm.viridis(np.linspace(0, 1, 5))

# Loop through the rating values to aggregate and plot data
for i, color in zip(range(1, 6), colors):
    subset = df.loc[df['Rating'] == i]
    grouped = subset.groupby('year', as_index=False).agg({'Rating': 'sum'})
    plt.plot(grouped['year'], grouped['Rating'], marker='o', label=str(i), color=color)

# Add legend, labels, and title
plt.legend(title='Rating')
plt.xlabel('Year')
plt.ylabel('Sum of Ratings')
plt.title('Sum of Ratings by Year')

# Show the plot
plt.show()
png
png

Sentiment Branches

plt.figure(figsize=(10,6))
sns.countplot(data=df,x='Branch',hue='Sentiment',palette='inferno');
png
png

Sentiment Percentages - Branch

# Define the branches to analyze
branches = ['Disneyland_California', 'Disneyland_Paris', 'Disneyland_HongKong']
colors = ['#228B22', '#CC0000', '#00BFFF']  # Define a common color scheme for the pie charts

# Initialize the subplot
fig, axes = plt.subplots(1, 3, figsize=(20, 7))
fig.suptitle('Branches Sentiment Distribution')

# Define text properties for pie chart labels to make them bold
textprops = {"weight": "bold"}  # making the text bold

# Loop through each branch and create the pie charts
for ax, branch in zip(axes, branches):
    # Filter the data frame by branch and get the sentiment counts
    sentiment_counts = df.loc[df['Branch'] == branch, 'Sentiment'].value_counts(sort=True)
    labels = sentiment_counts.index
    sizes = sentiment_counts.values

    # Create the pie chart for each branch
    ax.pie(sizes, labels=labels, startangle=90, shadow=True, autopct='%1.2f%%', colors=colors, textprops=textprops)
    ax.set_title(branch.split('_')[1])  # Set the title to the branch name

plt.show()

png
png

Sentiment Views

# Convert the 'Reviews_Text' column to a single string
df['Reviews_Text'] = df['Reviews_Text'].astype('str')
reviews_text = " ".join(txt for txt in df['Reviews_Text'])

# Create a WordCloud object without an image mask
wc = WordCloud(background_color='white', 
               mode='RGB', width=1000, max_words=1000, height=1000,
               random_state=1, contour_width=1, contour_color='black', colormap='flag')

# Generate the word cloud
wc.generate(reviews_text)

# Display the word cloud
plt.figure(figsize=(20, 10))
plt.imshow(wc, interpolation='bilinear')
plt.tight_layout(pad=0)
plt.axis('off')  # Remove the axes
plt.show()
png
png

Learnings

There are many variables to try to predict views. From a hollistic or macro perspective it helpful to have an overall picture of how the overall park is doing. Never the less, there are some good reviews based on food and character. I could have been more detailed on the word cloud. However, it does give you a good overall understanding of important themes or ideas to start looking into.