DisneyRev

Disneyland

About Disney? Disneyland Park is most certainly a land of enchantment where children and the young at heart find that dreams really do come true. Disneyland Park is a seamless blend of yesterday, today and tomorrow. Smiling train conductors, marching bands and the clip-clop of horse drawn carriages bring you back to carefree days. You can also rocket through the galaxy, board a runaway train or trek through the jungle. Laughter is always in the air, with friendly smiles all around. The secret to Disneyland is its ability to change yet, remain the same. As Walt Disney said “Disneyland will never be completed. It will continue to grow as long as there is imagination left in the world.”

Goal:

We aim to derive insightful information about Disneyland through this dataset, analyze reviews to draw meaningful conclusions, and conduct sentiment analysis on the visitor feedback

About the Dataset

The dataset includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor.

Column Description:

Review_ID: unique id given to each review
Rating: ranging from 1 (unsatisfied) to 5 (satisfied)
Year_Month: when the reviewer visited the theme park
Reviewer_Location: country of origin of visitor
Review_Text: comments made by visitor
Disneyland_Branch: location of Disneyland Park (California, Hong Kong, Paris)

Importing Dataset

# library used
import numpy as np 
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

data = "/Users/richarddiaz/Desktop/DisneylandReviews.csv"
df=pd.read_csv(data,encoding="cp1252")
df

	Review_ID	Rating	Year_Month	Reviewer_Location	Review_Text	Branch
0	670772142	4	2019-4	Australia	If you’ve ever been to Disneyland anywhere you…	Disneyland_HongKong
1	670682799	4	2019-5	Philippines	Its been a while since d last time we visit HK…	Disneyland_HongKong
2	670623270	4	2019-4	United Arab Emirates	Thanks God it wasn t too hot or too humid wh…	Disneyland_HongKong
3	670607911	4	2019-4	Australia	HK Disneyland is a great compact park. Unfortu…	Disneyland_HongKong
4	670607296	4	2019-4	United Kingdom	the location is not in the city, took around 1…	Disneyland_HongKong
…	…	…	…	…	…	…
42651	1765031	5	missing	United Kingdom	i went to disneyland paris in july 03 and thou…	Disneyland_Paris
42652	1659553	5	missing	Canada	2 adults and 1 child of 11 visited Disneyland …	Disneyland_Paris
42653	1645894	5	missing	South Africa	My eleven year old daughter and myself went to…	Disneyland_Paris
42654	1618637	4	missing	United States	This hotel, part of the Disneyland Paris compl…	Disneyland_Paris
42655	1536786	4	missing	United Kingdom	I went to the Disneyparis resort, in 1996, wit…	Disneyland_Paris

42656 rows × 6 columns

## cleaning for missing years
df=df.loc[df['Year_Month']=='missing']
df

	Review_ID	Rating	Year_Month	Reviewer_Location	Review_Text	Branch
269	647038712	4	missing	Philippines	The first thing on our agenda when we finished…	Disneyland_HongKong
282	646466731	3	missing	Singapore	Brought mum for the first time to Disneyland w…	Disneyland_HongKong
622	620580249	3	missing	Canada	I have been to Tokyo and LA Disneyland!I also …	Disneyland_HongKong
5347	318799221	4	missing	Australia	We pre bought tickets at the hotel (same price…	Disneyland_HongKong
5799	284745152	5	missing	Philippines	Disneyland is indeed the most magical place in…	Disneyland_HongKong
…	…	…	…	…	…	…
42651	1765031	5	missing	United Kingdom	i went to disneyland paris in july 03 and thou…	Disneyland_Paris
42652	1659553	5	missing	Canada	2 adults and 1 child of 11 visited Disneyland …	Disneyland_Paris
42653	1645894	5	missing	South Africa	My eleven year old daughter and myself went to…	Disneyland_Paris
42654	1618637	4	missing	United States	This hotel, part of the Disneyland Paris compl…	Disneyland_Paris
42655	1536786	4	missing	United Kingdom	I went to the Disneyparis resort, in 1996, wit…	Disneyland_Paris

2613 rows × 6 columns

#from previous dataframe up above we can use to match 'missing'
df=pd.read_csv(data,encoding="cp1252",na_values=['missing'])

## checking for null values 
df=df.dropna().reset_index()
print ("\nMissing values :  ", df.isnull().sum().values.sum())

Missing values :   0

## dropping any duplicate entries
df.drop_duplicates(subset='Review_Text', inplace=True, keep='first')

## data summary 
print ("Rows     : " ,df.shape[0])
print ("Columns  : " ,df.shape[1])
print ("\nFeatures : \n" ,df.columns.tolist())
print ("\nMissing values :  ", df.isnull().sum().values.sum())
print ("\nUnique values :  \n",df.nunique())

Rows     :  40022
Columns  :  7

Features : 
 ['index', 'Review_ID', 'Rating', 'Year_Month', 'Reviewer_Location', 'Review_Text', 'Branch']

Missing values :   0

Unique values :  
 index                40022
Review_ID            40014
Rating                   5
Year_Month             111
Reviewer_Location      162
Review_Text          40022
Branch                   3
dtype: int64

The dataset consists of 40,022 rows and 7 columns with no missing values, capturing unique Disneyland reviews (Review_Text, Review_ID) across three branches, rated on a 5-point scale (Rating). Despite reviews coming from 162 different locations and being unique in content and ID, they all correspond to the same period (Year_Month)

#review per branch
df['Branch'].value_counts()

Disneyland_California    18196
Disneyland_Paris         12691
Disneyland_HongKong       9135
Name: Branch, dtype: int64

The data shows the number of reviews for each Disneyland branch: 18,196 for Disneyland California, 12,691 for Disneyland Paris, and 9,135 for Disneyland Hong Kong.

Exploration - Text Length

Next, we want to be able to split some of value columns for year and month to be able to get quarter features. This would mimic quarterly results for park sentiments. Typically, quarterly earnings given some benchmark of how well parks are doing to improve or highlight specific reviews.

# new data frame with split value columns 
new = df["Year_Month"].str.split("-", n = 1, expand = True) 
  
# making separate year column from new data frame 
df["year"]= new[0] 
  
# making separate month column from new data frame 
df["month"]= new[1] 
  
# Dropping old feature  
df.drop(columns =["Year_Month"], inplace = True) 

# Keep month as integer value 
df['month']=df['month'].astype('int64')

# Quarterly extract - # labda allows us to break into quarter results using if and else statement
df['quarter']=df['month'].apply(lambda x:1 if x<=3 else (2 if 3<x<=6 else (3 if 6<x<=9 else (4 if 9<x<=12 else x))))

## review text of array 
df['Review_Text'].values[2:3]

array(['Thanks God it wasn   t too hot or too humid when I was visiting the park   otherwise it would be a big issue (there is not a lot of shade).I have arrived around 10:30am and left at 6pm. Unfortunately I didn   t last until evening parade, but 8.5 hours was too much for me.There is plenty to do and everyone will find something interesting for themselves to enjoy.It wasn   t extremely busy and the longest time I had to queue for certain attractions was 45 minutes (which is really not that bad).Although I had an amazing time, I felt a bit underwhelmed with choice of rides and attractions. The park itself is quite small (I was really expecting something grand   even the main castle which was closed by the way was quite small).The food options are good, few coffee shops (including Starbucks) and plenty of gift shops. There was no issue with toilets as they are everywhere.All together it was a great day out and I really enjoyed it.'],
      dtype=object)

# review length of array
df['review length']=df['Review_Text'].apply(lambda x:len(x))

#pip install wordcloud

Collecting wordcloud
  Downloading wordcloud-1.9.3-cp39-cp39-macosx_10_9_x86_64.whl (173 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.2/173.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: numpy>=1.6.1 in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (1.21.5)
Requirement already satisfied: matplotlib in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (3.5.2)
Requirement already satisfied: pillow in ./opt/anaconda3/lib/python3.9/site-packages (from wordcloud) (9.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (1.4.2)
Requirement already satisfied: packaging>=20.0 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (21.3)
Requirement already satisfied: cycler>=0.10 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (4.25.0)
Requirement already satisfied: pyparsing>=2.2.1 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in ./opt/anaconda3/lib/python3.9/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in ./opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.3
Note: you may need to restart the kernel to use updated packages.

## recapping library used: 
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator
import re
from PIL import Image

#==
features = df['Review_Text'].values
#=== 
processed_features = []

for sentence in range(0, len(features)):
    # Remove all the Http: urls
    processed_feature = re.sub('(https?://\S+)', '', str(features[sentence]))
    
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', processed_feature)

    # Remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)

# creating new dataframe for wordcloud 
df3=pd.DataFrame()
df3['reviews']=processed_features

#pip install textblob

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting nltk>=3.8
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hRequirement already satisfied: click in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (8.0.4)
Requirement already satisfied: joblib in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (1.1.0)
Requirement already satisfied: tqdm in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (4.64.1)
Requirement already satisfied: regex>=2021.8.3 in ./opt/anaconda3/lib/python3.9/site-packages (from nltk>=3.8->textblob) (2022.7.9)
Installing collected packages: nltk, textblob
  Attempting uninstall: nltk
    Found existing installation: nltk 3.7
    Uninstalling nltk-3.7:
      Successfully uninstalled nltk-3.7
Successfully installed nltk-3.8.1 textblob-0.18.0.post0
Note: you may need to restart the kernel to use updated packages.

from textblob import TextBlob
from wordcloud import WordCloud

Creating Function calls for: Subjectivity and Polarity

# Create a function to get the subjectivity
def getSubjectivity(text):
   return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity
def getPolarity(text):
   return  TextBlob(text).sentiment.polarity

# Create two new columns 'Subjectivity' & 'Polarity'
df3['Subjectivity'] = df3['reviews'].apply(getSubjectivity)
df3['Polarity'] = df3['reviews'].apply(getPolarity)

# Score determination for each review
def getAnalysis(score):
 if score < 0: # if score is less than zero -> negative
  return 'Negative'
 elif score == 0: # if score is = zero -> neutral
  return 'Neutral'
 else: # if score is otherwise -> positive
  return 'Positive'

# captured scored into new dataframe 
df3['Analysis'] = df3['Polarity'].apply(getAnalysis)
df3

	reviews	Subjectivity	Polarity	Analysis
0	if you ve ever been to disneyland anywhere you…	0.561481	0.239352	Positive
1	its been while since last time we visit hk dis…	0.459783	0.205797	Positive
2	thanks god it wasn too hot or too humid when w…	0.434857	0.119238	Positive
3	hk disneyland is great compact park unfortunat…	0.512143	0.189286	Positive
4	the location is not in the city took around 1 …	0.437500	0.266667	Positive
…	…	…	…	…
40017	although our pick up was prompt the taxi drive…	0.470556	0.034402	Positive
40018	just returned from 4 days family trip to disne…	0.437991	0.202937	Positive
40019	we spent the 20 dec 2010 in the disney park an…	0.493521	0.020628	Positive
40020	well was really looking forward to this trip o…	0.497893	0.125890	Positive
40021	if staying at disney hotel make good use of yo…	0.445000	0.115000	Positive

40022 rows × 4 columns

Sentiment Reviews

# sentiment analysis , ratings, and reviews 
df['sentiment']=df3['Analysis'] #analysis 
df['Sentiment']=df['Rating'].apply(lambda x:'Negative' if x<3 else np.nan) #check for negative using previous lambda for ratings
df['Sentiment']=df['Sentiment'].fillna(df['sentiment']) #sentiment score

df['Reviews_Text']=df3['reviews']

# datafram drop sentiment and review id since it is not necessary for now
df=df.drop(['sentiment','Review_ID'],axis=1)

df_copy=df.copy()
df.head()

	index	Rating	Reviewer_Location	Review_Text	Branch	year	month	quarter	review length	Sentiment	Reviews_Text
0	0	4	Australia	If you’ve ever been to Disneyland anywhere you…	Disneyland_HongKong	2019	4	2	329	Positive	if you ve ever been to disneyland anywhere you…
1	1	4	Philippines	Its been a while since d last time we visit HK…	Disneyland_HongKong	2019	5	2	970	Positive	its been while since last time we visit hk dis…
2	2	4	United Arab Emirates	Thanks God it wasn t too hot or too humid wh…	Disneyland_HongKong	2019	4	2	938	Positive	thanks god it wasn too hot or too humid when w…
3	3	4	Australia	HK Disneyland is a great compact park. Unfortu…	Disneyland_HongKong	2019	4	2	485	Positive	hk disneyland is great compact park unfortunat…
4	4	4	United Kingdom	the location is not in the city, took around 1…	Disneyland_HongKong	2019	4	2	163	Positive	the location is not in the city took around 1 …

Changes from previous dataframe:

Temporal Breakdown: The date information has been expanded from Year_Month to separate year, month, and quarter columns, offering a more granular temporal analysis.
Review Analysis Enhancements: A new column review length has been added, quantifying the length of each review, which can be useful for correlating review length with sentiment or ratings.
Sentiment Analysis: A Sentiment column has been introduced, providing a pre-analyzed sentiment (e.g., Positive) for each review, aiding in quick sentiment trend analysis.
Text Normalization: The Reviews_Text column has been normalized or simplified version of Review_Text to process for consistency or to facilitate text analysis and cleaning text.

Data Analysis

Reviews - Rolling Years

import seaborn as sns
import matplotlib.pyplot as plt

# Set the aesthetics for the plot
sns.set_style("whitegrid")
sns.set_context("talk")  # Larger font size

# Group the data and sum the review lengths by year
df3 = df.groupby('year', as_index=False).agg({'review length': 'sum'})

# Create the plot
plt.figure(figsize=(14, 7))
plt.plot(df3['year'], df3['review length'], marker='o', linestyle='-', label='Review Length by Year', color='blue')

# Add title and labels
plt.title('Review Length Summarized by Year', fontsize=20)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Total Review Length', fontsize=16)

# Show grid lines
plt.grid(True, which='both', linestyle='--', linewidth=0.5)

# Add legend
plt.legend(title='Legend', title_fontsize='13', fontsize='12', loc='upper left')

# Show the plot
plt.show()

png

We can see reviews start to increase by 2015 and decrease after.

Branch Comparisons Per Year

# Set the figure size
plt.figure(figsize=(10, 6))

# Create a countplot
sns.countplot(data=df, x='year', hue='Branch', palette='Blues')

# Calculate the moving average
window_size = 3  # Define the window size for the moving average

# Calculate the moving average for each branch
for branch in df['Branch'].unique():
    branch_data = df[df['Branch'] == branch]
    yearly_counts = branch_data.groupby('year').size().rolling(window=window_size).mean()
    plt.plot(yearly_counts.index, yearly_counts, label=f'{branch} MA', marker='o')

# Add labels
plt.xlabel('Year')
plt.ylabel('Count')
plt.title('Yearly Count by Branch with Moving Average')

# Add legend
plt.legend(fontsize=8)  # adjust here legend size

# Show the plot
plt.show()

png

Review Length - Quarterly Review

# Create a figure with two subplots side-by-side and set a larger overall figure size for better readability
fig, axes = plt.subplots(1, 2, figsize=(18, 6), sharey=False)

# Plotting the first subplot: Quarter by Ratings
sns.countplot(ax=axes[0], data=df, x='quarter', hue='Rating', palette='Set2')
axes[0].set_title('Quarter By Ratings', fontsize=16)
axes[0].set_xlabel('Quarter', fontsize=14)
axes[0].set_ylabel('Count', fontsize=14)
axes[0].grid(True, which='both', linestyle='--', linewidth=0.5)
axes[0].legend(title='Rating', title_fontsize=12, fontsize=10)

# Plotting the second subplot: Quarter per reviews
sns.countplot(ax=axes[1], data=df, x='quarter', palette='coolwarm')
axes[1].set_title('Quarter per Reviews', fontsize=16)
axes[1].set_xlabel('Quarter', fontsize=14)
axes[1].set_ylabel('Count', fontsize=14)
axes[1].grid(True, which='both', linestyle='--', linewidth=0.5)

# Adjust the layout to prevent overlap and ensure clarity
plt.tight_layout()

# Show the plot
plt.show()

png

From a quaterly perspective, our reviews are low for the first quarter.

Country: Lowest rates vs Highest rates

df4 = df.groupby(['Reviewer_Location'],as_index=False).agg({'Rating':'mean'}).sort_values(by='Rating', ascending=True).head(10)
df4.style.background_gradient(cmap="autumn", subset=['Rating'])

	Reviewer_Location	Rating
3	Andorra	2.000000
147	Turks and Caicos Islands	2.000000
132	South Sudan	2.000000
136	Suriname	2.000000
39	Ecuador	2.333333
58	Haiti	3.000000
129	Solomon Islands	3.000000
107	Northern Mariana Islands	3.000000
36	Democratic Republic of the Congo	3.000000
128	Slovenia	3.000000

df4 = df.groupby(['Reviewer_Location'], as_index=False).agg({'Rating': 'mean'}).sort_values(by='Rating', ascending=False).head(10)
df4.style.background_gradient(cmap="autumn", subset=['Rating'])

	Reviewer_Location	Rating
81	Libya	5.000000
23	Caribbean Netherlands	5.000000
31	Cuba	5.000000
32	Curaçao	5.000000
43	Ethiopia	5.000000
44	Falkland Islands (Islas Malvinas)	5.000000
49	Georgia	5.000000
54	Grenada	5.000000
66	Iraq	5.000000
89	Mali	5.000000

Rates Per Year



# Set the plot style
sns.set_style("darkgrid")

# Initialize the figure
plt.figure(figsize=(14, 7))

# Define colors for the plot lines
colors = plt.cm.viridis(np.linspace(0, 1, 5))

# Loop through the rating values to aggregate and plot data
for i, color in zip(range(1, 6), colors):
    subset = df.loc[df['Rating'] == i]
    grouped = subset.groupby('year', as_index=False).agg({'Rating': 'sum'})
    plt.plot(grouped['year'], grouped['Rating'], marker='o', label=str(i), color=color)

# Add legend, labels, and title
plt.legend(title='Rating')
plt.xlabel('Year')
plt.ylabel('Sum of Ratings')
plt.title('Sum of Ratings by Year')

# Show the plot
plt.show()

png

Sentiment Branches

plt.figure(figsize=(10,6))
sns.countplot(data=df,x='Branch',hue='Sentiment',palette='inferno');

png

Sentiment Percentages - Branch

# Define the branches to analyze
branches = ['Disneyland_California', 'Disneyland_Paris', 'Disneyland_HongKong']
colors = ['#228B22', '#CC0000', '#00BFFF']  # Define a common color scheme for the pie charts

# Initialize the subplot
fig, axes = plt.subplots(1, 3, figsize=(20, 7))
fig.suptitle('Branches Sentiment Distribution')

# Define text properties for pie chart labels to make them bold
textprops = {"weight": "bold"}  # making the text bold

# Loop through each branch and create the pie charts
for ax, branch in zip(axes, branches):
    # Filter the data frame by branch and get the sentiment counts
    sentiment_counts = df.loc[df['Branch'] == branch, 'Sentiment'].value_counts(sort=True)
    labels = sentiment_counts.index
    sizes = sentiment_counts.values

    # Create the pie chart for each branch
    ax.pie(sizes, labels=labels, startangle=90, shadow=True, autopct='%1.2f%%', colors=colors, textprops=textprops)
    ax.set_title(branch.split('_')[1])  # Set the title to the branch name

plt.show()

png

Sentiment Views

# Convert the 'Reviews_Text' column to a single string
df['Reviews_Text'] = df['Reviews_Text'].astype('str')
reviews_text = " ".join(txt for txt in df['Reviews_Text'])

# Create a WordCloud object without an image mask
wc = WordCloud(background_color='white', 
               mode='RGB', width=1000, max_words=1000, height=1000,
               random_state=1, contour_width=1, contour_color='black', colormap='flag')

# Generate the word cloud
wc.generate(reviews_text)

# Display the word cloud
plt.figure(figsize=(20, 10))
plt.imshow(wc, interpolation='bilinear')
plt.tight_layout(pad=0)
plt.axis('off')  # Remove the axes
plt.show()

png

Learnings

There are many variables to try to predict views. From a hollistic or macro perspective it helpful to have an overall picture of how the overall park is doing. Never the less, there are some good reviews based on food and character. I could have been more detailed on the word cloud. However, it does give you a good overall understanding of important themes or ideas to start looking into.