Introduction

This paper explores key factors influencing the financial success of movies, using a dataset of movie metadata. I analyze variables such as IMDb score, budget, gross revenue, director, and movie duration to uncover patterns in box office performance. By visualizing these relationships through various charts, the goal is to provide insights into how factors like movie length and director influence a film’s commercial success.

Through this analysis, I aim to reveal trends and correlations that can help guide decision-making in the film industry, offering a more data-driven understanding of what drives a movie’s box office revenue.

Dataset

The dataset “movie_metadata.csv” contains a variety of information about movies, including details on the cast, crew, genre, performance metrics, and social media engagement. It provides a wealth of data that could be used to explore relationships between different variables such as the movie’s budget, gross earnings, IMDb score, and social media popularity (Facebook likes for directors and actors). However, the dataset has several missing values across different columns. Key columns with missing data include director_name (104 missing), budget (492 missing), gross (884 missing), and content_rating (303 missing). These missing values may have a significant impact on analysis, particularly when working with performance metrics such as earnings and ratings.

To address the missing data, preprocessing steps were performed. For numerical columns like num_critic_for_reviews, duration, and gross, missing values were filled with the mean of the respective column, ensuring that no data points are lost for numerical analysis. For categorical columns like color, director_name, and language, missing values were replaced with the placeholder “NA” to maintain consistency across the dataset. Additionally, the title_year column, which had some missing values, was filled with a default value of 1900 and converted to integers. After these preprocessing steps, the dataset is more complete and ready for analysis, with missing values substantially reduced.

Findings

Binned Budget vs. Binned IMDb Score

The scatterplot visualizes the relationship between binned movie budgets and binned IMDb scores, with the size of each point representing the frequency of movies within each combination of budget and score bin. The largest points appear in the IMDb score ranges of 5-7 and 7-9, which are prevalent across all budget bins. This suggests that movies with mid-range IMDb scores are more common, regardless of budget size.

On the other hand, both extremely low (0-3) and high (9-10) IMDb scores are rare across all budget categories, as reflected by the smaller points in these score bins. This pattern indicates that regardless of budget, movies with very low or very high ratings are less frequent. The plot also shows that higher-budget films are not necessarily associated with top IMDb ratings, and lower-budget films tend to be concentrated in the mid-range score bins as well. Overall, the scatterplot highlights that movies with moderate budgets and moderate IMDb scores are the most common, while both low-rated and high-rated movies are comparatively rare.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib.ticker import FuncFormatter

movie_data_path = r"C:\Users\wesma\OneDrive\Documents\DS-736\Python_datafiles\movie_metadata.csv"

df = pd.read_csv(movie_data_path)

num_columns = ['num_critic_for_reviews','duration','director_facebook_likes','actor_3_facebook_likes','actor_1_facebook_likes',
    'gross','num_user_for_reviews','budget','actor_2_facebook_likes','facenumber_in_poster']
for c in num_columns:
    df[c] = df[c].fillna(df[c].mean())

string_columns = ['color','director_name','actor_2_name','actor_1_name','actor_3_name','plot_keywords','language','country','content_rating']
for c in string_columns:
    df[c] = df[c].fillna("NA")

df.title_year = df.title_year.fillna(1900)
df.title_year = df.title_year.astype(int)

df=df.drop_duplicates(subset='movie_title') 

df['score_bin'] = pd.cut(df.imdb_score, bins=[0, 3, 5, 7, 9, 10], labels=['0-3', '3-5', '5-7', '7-9', '9-10'])

df['budget_bin'] = pd.cut(df.budget, bins=[0, 1000000, 5000000, 10000000, 20000000,30000000,40000000,50000000, 100000000, float('inf')] , labels=['0-1M', '1M-5M', '5M-10M', '10M-20M','20M-30M','30M-40M','40M-50M', '50M-100M', '100M+'])
count_df=df.groupby(['budget_bin', 'score_bin']).size().reset_index(name='count')
count_df['budget_bin_code'] = count_df.budget_bin.cat.codes
count_df['score_bin_code'] = count_df.score_bin.cat.codes

plt.figure(figsize=(16,10))
plt.scatter(count_df.budget_bin_code, count_df.score_bin_code, marker='8', cmap='viridis', c=count_df['count'], s=count_df['count']*10, edgecolor='black')
plt.title('Binned Budget vs. Binned IMDB Score', fontsize=18)
plt.xlabel('Budget', fontsize=14)
plt.ylabel('IMDB Score', fontsize=14)

cbar=plt.colorbar()
cbar.set_label('Count', rotation=270, fontsize=14, color='black', labelpad=30)

my_colorbar_ticks = [*range(0, int(count_df['count'].max()),50  )]
cbar.set_ticks(my_colorbar_ticks)
plt.xticks(ticks=range(len(df.budget_bin.cat.categories)),labels=df.budget_bin.cat.categories)
plt.yticks(ticks=range(len(df.score_bin.cat.categories)),labels=df.score_bin.cat.categories)
plt.show()

Budget and Gross Revenue of Top Grossing Films

The dual-axis bar chart visualizes the budgets and gross revenues of the top 10 highest-grossing films, with the budget represented by gray bars and the gross revenue by blue bars. As expected, there is a clear trend where the budgets of these films are substantial, with the largest being Avengers: Age of Ultron and The Dark Knight Rises, both with budgets of $250 million. These high-budget films are generally paired with even larger gross revenues, underscoring the strong financial returns of blockbuster films. For example, Avatar, with a budget of $237 million, earned $761 million in gross revenue, illustrating the massive financial success of high-budget films.

However, the chart also reveals an interesting insight: while there is a general correlation between higher budgets and higher gross revenues, the relationship is not always proportional. Jurassic World, with a budget of $150 million, earned $652 million, while Shrek 2, with a similar budget, earned $436 million. This highlights how some films with relatively moderate budgets still achieve exceptional box-office returns. The chart shows that, in general, blockbuster films tend to spend significant amounts on production, but their ability to generate revenue can vary, with some achieving returns well beyond their budget, and others falling short of the expected profit margins. Overall, the chart highlights the financial dynamics of the film industry, where substantial investments are often matched by significant earnings, but exceptions do exist.

top_grossing = df.sort_values('gross', ascending=False).head(10)
def autolabel(these_bars, this_axis, place_of_decimals, symbol):
    for each_bar in these_bars:
        height = each_bar.get_height()
        formatted_height = f"{symbol}{height / 1e6:.0f}M"
        this_axis.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01, formatted_height,
                    fontsize=10, color='black', ha='center', va='bottom')
fig = plt.figure(figsize=(18,10))
ax1 = fig.add_subplot(1,1,1)
ax2 = ax1.twinx()
bar_width = 0.4

x_pos = np.arange(10)
budget_bars=ax1.bar(x_pos-(0.5*bar_width), top_grossing.budget, bar_width, color='gray', edgecolor='black', label='Budget')
gross_bars = ax2.bar(x_pos+(0.5*bar_width), top_grossing.gross, bar_width, color='blue', edgecolor='black', label='Gross')

ax1.set_xlabel('Movies',fontsize=18,)
ax1.set_ylabel('Budget', fontsize=18, color='gray')
ax2.set_ylabel('Gross Revenue',fontsize=18, rotation=270, labelpad=20, color='blue')
plt.title('Budget and Gross Revenue of Top 10 Grossing Films', fontsize=18)

ax1.set_xticks(x_pos)
ax1.set_xticklabels(top_grossing.movie_title, rotation=45, ha='right')



budget_color, budget_label = ax1.get_legend_handles_labels()
gross_color, gross_label = ax2.get_legend_handles_labels()

legend=ax1.legend(budget_color + gross_color, budget_label + gross_label, loc='upper right', frameon=True, ncol=1, shadow=True,
                 borderpad=1, fontsize=14)

autolabel(budget_bars, ax1, '.0f', '$', )
autolabel(gross_bars, ax2, '.0f', '$', )
plt.show()

Top 5 Directors: Average Gross Revenue Over Time

The multi-line chart showing the average gross revenue of the top 5 directors over time reveals distinct patterns in their box office performance. Steven Spielberg is almost always at the top, with notable peaks in the early 1980s and late 2000s. Despite some dips he consistently maintains the highest average gross revenue. Clint Eastwood remains consistently in second place until the early 2000s, though far behind Spielberg, with a notable peak in the early 2010s, reflecting a surge in his box office success during that period. Woody Allen stays relatively consistent but remains at the lower end of the top 5, with a small peak in the early 2010s, coinciding with a significant dip in revenue for both Spielberg and Scorsese.

Martin Scorsese stays in the mid-pack for most of the timeline, with noticeable peaks in the mid and late 2000s, reflecting strong performances with films like The Departed and The Aviator. Spike Lee shows a more erratic pattern, with earnings fluctuating over the years, but experiencing a peak in the mid-2000s. Overall, Spielberg’s dominance stands out, while the others have more variable success, with some periods of growth and others of decline.

df_cleaned = df[df.director_name != 'NA']

top_directors = df_cleaned.director_name.value_counts().head(5).index

df_top_directors = df_cleaned[df_cleaned.director_name.isin(top_directors)]

director_data = df_top_directors.groupby(['director_name', 'title_year'])['gross'].mean().reset_index()

plt.figure(figsize=(18, 10))

for director in top_directors:
    director_data_filtered = director_data[director_data['director_name'] == director]
    plt.plot(director_data_filtered['title_year'], director_data_filtered['gross'], label=f'{director}', marker='o')

plt.title('Top 5 Directors: Average Gross Revenue Over Time', fontsize=18)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average Gross Revenue ($)', fontsize=14)

plt.xlim(1965,) 
plt.legend(loc='upper left', fontsize=12)
plt.show()

Gross Revenue Distribution by Movie Duration

The donut chart visualizes the distribution of gross revenue across different movie duration categories, it shows that while shorter and mid-length films dominate in terms of revenue, there are fewer movies in the longer duration categories. The 90-120 minutes category contributes the largest share of gross revenue (51.65% or $121.4 billion), which makes sense given that this duration range is likely to include a large portion of popular films. The 120-150 minutes category follows, contributing 24.76% or $58.2 billion, but with fewer films in this range compared to shorter films.

The 0-90 minutes category, while accounting for 15.06% of the total gross ($35.4 billion), likely represents a smaller number of films, as shorter runtimes may be associated with niche genres or lower-budget productions. On the other hand, the 150-180 minutes and 180+ minutes categories make up much smaller portions of the total revenue, 6.17% and 2.36%, respectively, but these categories also likely represent fewer films overall, as longer runtimes are less common in mainstream cinema. This suggests that while longer films may generate substantial revenue when they do perform well, they are outpaced by the sheer volume of shorter films in the mid-range duration category.

df['duration_category'] = pd.cut(df.duration, bins=[0, 90, 120, 150, 180, float('inf')], labels=['0-90 mins', '90-120 mins', '120-150 mins', '150-180 mins', '180+ mins'])
outside_color_ref_number = np.arange(5)*4

fig=plt.figure(figsize=(8,8))
ax = fig.add_subplot(1,1,1)
colormap=plt.get_cmap("tab20c")

outer_colors = colormap(outside_color_ref_number)

total_gross = df.gross.sum()

df.groupby('duration_category')['gross'].sum().plot(
    kind='pie', radius=1, colors = outer_colors, pctdistance = 0.80, labeldistance = 1.01, 
    wedgeprops = dict(edgecolor='w'),textprops = {'fontsize':14},
    autopct = lambda p: '{:.2f}%\n(${:.1f}B)'.format(p,(p/100)*total_gross/1e+9),
    startangle=90)

hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.yaxis.set_visible(False)

ax.text(0,0, 'Total Gross\n' +'$' +str(round(total_gross/1e+9, 2)) + 'B', ha='center', va='center', fontsize=14)
ax.axis('equal')
plt.title('Gross Revenue Distribution by Movie Duration', fontsize=18)
plt.tight_layout()


plt.show()

Heatmap of the Average Gross Revenue By IMDb Score Bin and Decade

The heatmap reveals a clear trend where higher IMDb scores consistently correlate with higher average gross revenue, particularly in the 8-9 and 9-10 score bins. The most notable peak in revenue occurs in the 1970s, particularly for films with IMDb scores of 9-10, such as Star Wars, which achieved both critical acclaim and substantial box office earnings. Across multiple decades, films with higher ratings especially those in the 8-9 and 9-10 score ranges demonstrate stronger commercial success. Low-rated films from 1980s-2010 struggled more at the box office than other years. Overall, the data highlights the strong connection between film ratings and financial performance.

df['year_bin'] = pd.cut(df.title_year, bins=[1960, 1970, 1980, 1990, 2000, 2010, 2020], labels=['1960-1969', '1970-1979', '1980-1989', '1990-1999', '2000-2009', '2010-2019'])
df['score_bin'] =pd.cut(df['imdb_score'], bins=[0,1,2,3,4,5,6,7,8,9,10], labels=['0-1', '1-2', '2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10'])
df['gross_scaled'] = df.gross / 1e6

avg_gross_by_year_score = df.groupby(['year_bin', 'score_bin'])['gross_scaled'].mean().unstack()

avg_gross_by_year_score = df.groupby(['year_bin', 'score_bin'])['gross_scaled'].mean().unstack()
fig = plt.figure(figsize=(18,10))
ax=fig.add_subplot(1,1,1)

ax = sns.heatmap(avg_gross_by_year_score, linewidths=0.2, annot=True, cmap='coolwarm',fmt='.2f',
                 annot_kws={'size':11})
plt.title('Heatmap of the Average Gross Revenue by IMDB Score Bin and Decade (1960-2020)', fontsize=18, pad=15)
plt.xlabel('IMDB Score Range', fontsize=14, labelpad=10)
plt.ylabel('Decade', fontsize=14, labelpad=10)
plt.yticks(size=12)
plt.xticks(size=12)
ax.invert_yaxis()

cbar=ax.collections[0].colorbar
my_colorbar_ticks = [*range(0,140, 10)]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels=['${:.0f}M'.format(each) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)
cbar.set_label('Average Gross Revenue (Millions)', rotation = 270, fontsize=14, labelpad=15)
plt.show()

Conclusion

The analysis of the movie dataset provides valuable insights into how various factors, such as budget, IMDb score, director, and movie duration, influence box office success. The scatter plot of budget versus IMDb score shows that the majority of movies with moderate budgets (between $10M and $40M) and average to good IMDb ratings (5-9) tend to perform well, with a few high-budget outliers, such as Avatar and A New Hope. While higher budgets generally correlate with higher earnings, films like A New Hope, with a modest budget of $11M, show that strong storytelling and fan base can also lead to exceptional financial success.

Directors like Steven Spielberg consistently produce films with high box office returns, but his box office returns are quite varied also. This suggests that while certain directors are more successful overall, there is still a wide range of performance within the industry, often driven by the quality of the films, rather than just the director’s name alone.

The donut chart and heatmap reveal further nuances in box office performance. The donut chart illustrates that most films have a runtime between 90 and 120 minutes, a category that also dominates in gross earnings, though there are fewer films in the longest categories. The heatmap reinforces the importance of IMDb ratings, showing that higher-rated films tend to perform better in terms of gross revenue. Overall, successful films often feature moderate budgets, solid ratings, with directors and production choices also playing significant roles in their success.