Marvel Movie Data

For my Python Assignment, I chose to utilize a data set from Kaggle that contains data regarding ‘Marvel Movies’ which can be found here: https://www.kaggle.com/datasets/joebeachcapital/marvel-movies. The data consists of columns such as: film, category, budget, audience scores, critic scores, gross, and year. With all of this information, I was able to produce five visualizations to learn more about how different factors affect the success of a Marvel movie.

In order to make sure the data was accurately read in to the data frame, a visual of the data frame can be seen below. As well as the packages that were necessary to install.

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import warnings
import matplotlib.patches as mpatches

warnings.filterwarnings("ignore")
# Load in Marvel Dataset
file = "MarvelMovies.csv"

# Read the file into a df
marvel_df = pd.read_csv(file)

# Drop the source column so that it prints properly in R Markdown
marvel_df = marvel_df.drop(['source'], axis=1)

# Display the df to make sure data accurately displays
marvel_df.head(10)


Visualization 1 - Scatter Plot

For the first visualization I decided to make a Scatter Plot. This plot plots the budgets of each individual film, then, the markers on the plot (denoted by diamonds), represent the gross amount that each movie made. I chose to create this plot in order to compare how well the movie did in terms of gross versus the budget they were given to work with.

# Visualization 1

# Sort the df by year released
marvel_df = marvel_df.sort_values(by='year')

# Make my own color map to match Marvel Infinity stones for fun (blue, purple, green, red, yellow, orange)
infinitycolors = ['#266ef6', '#e429f2', '#12e772', '#ff0130', '#ffd300','#ff8b00']
infinitystone_cmap = ListedColormap(infinitycolors)

# Create Scatter Plot
plt.figure(figsize=(17,17))

plt.scatter(marvel_df['film'], marvel_df['budget'], marker='d', cmap=infinitystone_cmap,
           c=marvel_df['worldwide gross ($m)'], s=marvel_df['worldwide gross ($m)'], edgecolors='black')

plt.title('Marvel Movie Budget Scatter Plot')
plt.xlabel('Movie')
plt.ylabel('Budget (Millions)')
plt.xticks(rotation=70, ha='right')

plt.show()

From the scatter plot, you can see that it is more consistent for movies to have a higher gross when given a bigger budget, but when it comes to Marvel there are also a few outliers that give a high gross with a lower budgeted movie.


Visualization 2 - Dual Axis Bar Chart

For my second visualization I made a bar chart comparing audience and critic ratings/scores for the top 10 movies based on gross, picking the top 10 so that the visualization would not come out too overcrowded.

# Visualization 2

# Sort the file by worldwide gross ($m) and create a data frame of the top 10 movies based on worldwide gross ($m)
top10Movies = marvel_df.sort_values(by='worldwide gross ($m)', ascending=False).head(10)

# Create the dual axis bar chart of the critics vs audience % score of the top 10 movies

# Convert the 'critics % score' and 'audience % score columns' to be recognized as floats
top10Movies['critics % score'] = top10Movies['critics % score'].str.rstrip('%').astype('float')
top10Movies['audience % score'] = top10Movies['audience % score'].str.rstrip('%').astype('float')

# Create auto labels for the bar chart
def autolabel(these_bars, this_ax, place_of_decimals):
    for each_bar in these_bars:
        height = each_bar.get_height()
        this_ax.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01, format(height, place_of_decimals)
                     + '%', fontsize=11, color='black',ha='center',va='bottom')

# Create the bar chart
fig = plt.figure(figsize=(22,15))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()
bar_width = 0.42

x_pos = np.arange(len(top10Movies))
critic_bars = ax1.bar(x_pos-(0.5*bar_width), top10Movies['critics % score'], bar_width, color='#266ef6',
                      edgecolor='black', label='Critic Score')

audience_bars = ax2.bar(x_pos+(0.5*bar_width), top10Movies['audience % score'], bar_width, color='#e429f2',
                      edgecolor='black', label='Audience Score')

ax1.set_xlabel('Movie', fontsize=18)
ax1.set_ylabel('Critic Scores', fontsize=18, labelpad=20)
ax2.set_ylabel('Audience Scores', fontsize=18, rotation=270, labelpad=20)
ax1.tick_params(axis='y', labelsize = 14)
ax2.tick_params(axis='y', labelsize = 14)

plt.title('Critic and Audience Scores', fontsize=18)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(top10Movies['film'], fontsize=14, rotation=50, ha='right')

critic_color, critic_label = ax1.get_legend_handles_labels()
audience_color, audience_label = ax2.get_legend_handles_labels()
legend = ax1.legend(critic_color + audience_color, critic_label + audience_label, loc='upper right',
                    frameon=True, ncol=1, shadow=True, borderpad=1, fontsize=14)

autolabel(critic_bars, ax1,'.0f')
autolabel(audience_bars, ax2,'.0f')

plt.tight_layout()
plt.show()

In this chart you can see that there is not much of a difference in ratings between the audience and the critics. The only major difference is ‘Captain Marvel’, which was definitely not an audience favorite movie. Looking at the prior scatter plot for visualization 1, you can see that it made a decent amount of money to have such a low rating. It was definitely a very highly anticipated movie which I believe is why it made a decent amount of money.

Visualization 3 - Nested Pie Chart

For my third visualization I did a nested pie chart. The outside circle is the overall gross for the top 4 movies while the inner circle breaks up the overall gross into three sections for when the gross was made: opening weekend (1st), first weekend (2nd), and the left over from miscellaneous times (3rd).

# Visualization 3

# find the top 4 movies based on worldwide gross
top4movies = marvel_df.sort_values(by='worldwide gross ($m)', ascending=False).head(4)

top4movies = top4movies[['film','worldwide gross ($m)', 'opening weekend ($m)', 'second weekend ($m)']]

# create a data column to find the difference between the budget on the first two weekends and the overall
top4movies['other'] = top4movies['worldwide gross ($m)'] - (top4movies['opening weekend ($m)']
                                                            + top4movies['second weekend ($m)'])

#combine the data frames from the top4movies data

piedf = pd.melt(top4movies, id_vars=['film'], value_vars=
                 ['opening weekend ($m)','second weekend ($m)','other'])

piedf.rename(columns={'variable': 'time', 'value': 'gross'}, inplace=True)
piedf['time'] = ['1st', '2nd', 'other','1st', '2nd', 'other',
                                   '1st', '2nd', 'other',
                                   '1st', '2nd', 'other',]


piedf = piedf.set_index('film')
piedf = piedf.sort_values(by='film')
piedf[['gross']]=piedf[['gross']].astype(int)

totalgross = top4movies['worldwide gross ($m)'].sum()

#colors (blue, purple, red, yellow)
outside_colors = ['#266ef6', '#e429f2','#ff0130','#ffd300']
inner_colors = ['#C5D9FF', '#97BAFF','#6197FF',
               '#ECC2EF', '#E696ED','#E373EC',
               '#FFC6D0', '#FF93A7','#FF5D7B',
               '#FFF6C8', '#FFEE9B','#FFE569']


# create the pie chart
fig = plt.figure(figsize=(16,16))
ax = fig.add_subplot(1,1,1)

# create the outside donut
piedf.groupby(['film'])['gross'].sum().plot(kind = 'pie', radius=1.15, pctdistance=0.87, 
                                                          labeldistance=1.1, wedgeprops=dict(edgecolor='white'), 
                                                          textprops= {'fontsize':11}, autopct = lambda p: 
                                                          '{:.2f}%\n(${:.0f}M)'.format(p,(p/100)*
                                                                                       piedf['gross'].sum())
                                                          ,startangle=90,colors=outside_colors)

# create the inner donut
piedf.gross.plot(kind = 'pie', radius=0.89, colors = inner_colors, pctdistance=0.70,labeldistance=0.91, 
                 wedgeprops=dict(edgecolor='white'),
                 textprops= {'fontsize':10}, 
                 labels = piedf.time,
                 autopct = lambda p: '{:.0f}%\n(${:.0f}M)'.format(p,(p/100)*piedf['gross'].sum()),
                 startangle=90)

# create the white circle center
hole = plt.Circle((0,0), 0.45, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

# create the titles of plot
ax.yaxis.set_visible(False)
plt.title('Total Gross by Movie ($m)', fontsize=20)

ax.text(0, 0, 'Total Gross\n' + '$' + str(round(totalgross)) + 'M', size=10, ha='center', va='center')


ax.axis('equal')
plt.tight_layout()

plt.show()

The outcome of the nested pie chart surprised me at just how close of an amount of money ‘Avengers: Infinity War’ made and ‘Spiderman: No Way Home’. I knew how highly anticipated ‘Spiderman: No Way Home’ was, in having all three Spidermans in it, but it still shocked me considering ‘Avengers: Infinity War’ was such a big part of the Marvel movies.


Visualization 4 - Bump Chart

For my fourth visualization I made a Bump Chart. I again used the top 10 movies for this one, as used previously in the dual axis bar chart. And for this chart, I ranked the amount of money made each weekend for each movie to get a different perspective of some of the information from the previous chart.

# Visualization 4

# Bump Chart

# create a df of the top 10 movies
top10movies = marvel_df.sort_values(by='worldwide gross ($m)', ascending=False).head(10)

top10movies = top10movies[['film','worldwide gross ($m)', 'opening weekend ($m)', 'second weekend ($m)']]

# create a data column to find the difference between the gross on the first two weekends and the overall
top10movies['other'] = top10movies['worldwide gross ($m)'] - (top10movies['opening weekend ($m)'] 
                                                                      + top10movies['second weekend ($m)'])

# create a data frame for bump
bumpdf = pd.melt(top10movies, id_vars=['film'], value_vars=
                 ['opening weekend ($m)','second weekend ($m)','other'])

bumpdf.rename(columns={'variable': 'time', 'value': 'gross'}, inplace=True)

bumpdf = bumpdf.sort_values(by='film')
bumpdf[['gross']]=bumpdf[['gross']].astype(int)

bumpdf = bumpdf.groupby(['film', 'time'])['gross'].sum().reset_index(name='TotalGross')

bumpdf = bumpdf.pivot(index='film', columns='time', values='TotalGross')

time_order = ['opening weekend ($m)','second weekend ($m)', 'other']

bumpdf = bumpdf.reindex(columns=time_order)

bumpdf_rank = bumpdf.rank(0, ascending=False, method='min')

bumpdf_rank = bumpdf_rank.T

# create plot

fig = plt.figure(figsize=(20,12))
ax = fig.add_subplot(1,1,1)

bumpdf_rank.plot(kind='line',ax=ax, marker='o',markeredgewidth=1,linewidth=6,
                markersize=52, markerfacecolor='white')

ax.invert_yaxis()

num_rows = bumpdf_rank.shape[0]
num_cols = bumpdf_rank.shape[1]

plt.ylabel('Time Ranking', fontsize=18, labelpad=10)
plt.title('Ranking of Total Gross by Time and Film \n Bump Chart', fontsize = 18, pad = 15)

plt.yticks(range(1, num_cols+1, 1), fontsize = 14)

ax.set_xlabel('Time', fontsize = 10)

handles, labels = ax.get_legend_handles_labels()

handles = [handles[1], handles[8], handles[2], handles[9], handles[3],
          handles[0], handles[4], handles[6], handles[5], handles[7]]
labels = [labels[1], labels[8], labels[2], labels[9], labels[3],
         labels[0], labels[4], labels[6], labels[5], labels[7]]
ax.legend(handles, labels, bbox_to_anchor=(1.05,1.05),
         labelspacing=1,
         markerscale=.3,
         handletextpad=0.8)

i = 0
j = 0

for eachcol in bumpdf_rank.columns:
    for eachrow in bumpdf_rank.index:
        this_rank = bumpdf_rank.iloc[i, j]
        ax.text(i, this_rank, '$' + str(round(bumpdf.iloc[j, i])) + 'M', ha='center', va='center', fontsize=12)
        i+=1
    j+=1
    i=0
    
plt.show()

For this visualization I was very surprised that ‘Spiderman: Far From Home’ fell below ‘Captain Marvel’ in the beginning. Looking at the two in the dual axis bar chart, Spiderman had much higher ratings, but I suppose again that ‘Captain Marvel’ was a very anticipated movie compared to the second Spiderman. I was also suprised that ‘Captain Marvel’ didn’t finish last overall, but then again ‘Captain America: Civil War’ fell below it at the end which could make sense because that movie was released in 2016 which was before all the other bigger Marvel movies ‘infinity war’ and ‘end game’.


Visualization 5 - Bar Chart

For my fifth visualization I did a regular bar chart. In this chart I looked at percentage of budget that was recovered for the top ten movies.

# Visualization 5

# Bar Chart

# Create a df for the bar chart

barchartdf = marvel_df[['film','% budget recovered']].head(10)
barchartdf = barchartdf.sort_values(by='% budget recovered', ascending=True)
barchartdf['% budget recovered']=barchartdf['% budget recovered'].str.replace('%','')
barchartdf[['% budget recovered']]=barchartdf[['% budget recovered']].astype(int)

budget_recovered_mean = barchartdf['% budget recovered'].mean()

barchart_colors = ['#12e772' if val > budget_recovered_mean else '#ff8b00' for val 
                   in barchartdf['% budget recovered']]


Above = mpatches.Patch(color='#12e772', label = 'Above Average')
Below = mpatches.Patch(color='#ff8b00', label = 'Below Average')

fig = plt.figure(figsize=(18,12))
ax1 = fig.add_subplot(1,1,1)
ax1.barh(barchartdf['film'], barchartdf['% budget recovered'], color = barchart_colors)

ax1.legend(loc='lower right', handles=[Above, Below], fontsize=14)
plt.axvline(barchartdf['% budget recovered'].mean(), color='black', linestyle='dashed')

ax1.set_title('% Budget Recovered', size=20)
ax1.set_xlabel('budget', fontsize=16)
ax1.set_ylabel('film', fontsize = 16, rotation=45)

plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

For this chart the only surprise to me was that ‘Captain America: Civil War’ made above the average percentage of all the top ten movie budgets. Considering it fell last in the bump chart I didn’t think it would have made that much of it’s budget back, but looking at the first visualization it does make sense.