Python Project

DATASET SUMMARY

My dataset has information on the top 1,000 highest grossing films from holywood. It has a lot of financial details like, budget, domestic & international sales, and worldwide revenue which lets me get more in depth analysis. Some of the details provided for each movie included, title, distributor, release date, genre, and running time. This allows me to identify trends in the film industry. It also has licensing information and domestic opening revenue, which can help to see how much success a movie had initially when compared to what they have made in the long term.

VISUALIZATION 1

# VISUALIZATION 1
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv(r"E:\Downloads\movies.csv") 

df["World Wide Sales (in $)"] = pd.to_numeric(df["World Wide Sales (in $)"], errors="coerce")

df_sorted = df.sort_values("World Wide Sales (in $)", ascending=False).head(5)

plt.figure(figsize=(10, 5))
sns.barplot(x="Title", y="World Wide Sales (in $)", hue="Title", data=df_sorted, legend=False, palette="viridis")

plt.xlabel("Movie Title")
plt.ylabel("Worldwide Sales ($)")
plt.title("Top 5 Movies by Worldwide Sales")
plt.xticks(rotation=45)

## ([0, 1, 2, 3, 4], [Text(0, 0, 'Avatar'), Text(1, 0, 'Avengers: Endgame'), Text(2, 0, 'Avatar: The Way of Water'), Text(3, 0, 'Titanic'), Text(4, 0, 'Star Wars: Episode VII - The Force Awakens')])

plt.show()

VISUALIZATION 1 SUMMARY

This visualization shows the top 5 highest grossing movies based on information given on worldwide sales. By sorting the dataset in descending order, I am able to identify films that made the most money globally. The bar chart compares the earnings, and the use of different colors helps to enhance clarity.. This effectively shows how successful each of hollywood’s biggest blockbusters have been financially and provides insight into how significant each detail is.

VISUALIZATION 2

# VISUALIZATION 2

import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
import ast  # To safely evaluate the string representation of lists

# Initialize an empty list to store genres
genre_list = []

# Loop through the 'Genre' column in your dataframe and parse the string representation of the genres
for genres in df["Genre"]:
    genre_list.extend(ast.literal_eval(genres))  # Safely parse the string as a list

# Count the occurrences of each genre
genre_counts = Counter(genre_list)

# Create the pie chart
plt.figure(figsize=(8, 8))
plt.pie(genre_counts.values(), labels=genre_counts.keys(), autopct="%1.1f%%", colors=sns.color_palette("pastel"))

## ([<matplotlib.patches.Wedge object at 0x0000023270F35370>, <matplotlib.patches.Wedge object at 0x0000023270E37B90>, <matplotlib.patches.Wedge object at 0x0000023272004BF0>, <matplotlib.patches.Wedge object at 0x00000232720051F0>, <matplotlib.patches.Wedge object at 0x0000023272005850>, <matplotlib.patches.Wedge object at 0x0000023272005E80>, <matplotlib.patches.Wedge object at 0x0000023272006450>, <matplotlib.patches.Wedge object at 0x0000023272006AE0>, <matplotlib.patches.Wedge object at 0x0000023272007110>, <matplotlib.patches.Wedge object at 0x0000023272007770>, <matplotlib.patches.Wedge object at 0x0000023270EDA9C0>, <matplotlib.patches.Wedge object at 0x000002327205C3B0>, <matplotlib.patches.Wedge object at 0x000002327205C9E0>, <matplotlib.patches.Wedge object at 0x000002327205CF80>, <matplotlib.patches.Wedge object at 0x000002327205D5B0>, <matplotlib.patches.Wedge object at 0x000002327205DBB0>, <matplotlib.patches.Wedge object at 0x000002327205E180>, <matplotlib.patches.Wedge object at 0x000002327205E7E0>, <matplotlib.patches.Wedge object at 0x000002327205EDE0>, <matplotlib.patches.Wedge object at 0x000002327205F440>, <matplotlib.patches.Wedge object at 0x000002327205FA70>], [Text(0.9935525293652474, 0.47207348092423007, 'Action'), Text(0.2569300092656708, 1.0695732655310446, 'Adventure'), Text(-0.48073057051145585, 0.9893928029734855, 'Fantasy'), Text(-0.8896928051628106, 0.646874572418432, 'Sci-Fi'), Text(-1.0929211848396825, 0.12459247059363114, 'Drama'), Text(-1.054104208218586, -0.3144269680162115, 'Romance'), Text(-0.887995319184236, -0.6492028289424553, 'Family'), Text(-0.7007518890756883, -0.8479073003323266, 'Musical'), Text(-0.5585726422160313, -0.9476268270621097, 'Crime'), Text(-0.18638701637698335, -1.0840940365697462, 'Thriller'), Text(0.22600896109266957, -1.0765314438072917, 'Animation'), Text(0.6858387525683957, -0.8600146542213262, 'Comedy'), Text(0.9786641484638744, -0.5022115933662621, 'Mystery'), Text(1.0329299226977062, -0.37822714708982846, 'Biography'), Text(1.0554668520063653, -0.30982208493871716, 'Music'), Text(1.0702426035416908, -0.2541274671583215, 'War'), Text(1.0871834441590116, -0.16742807036620144, 'Horror'), Text(1.09637533465321, -0.0892251397536708, 'Sport'), Text(1.0984400956856857, -0.05856070517012069, 'Western'), Text(1.099737937455573, -0.02400976719926186, 'History'), Text(1.0999995806356053, -0.000960521469231473, 'Documentary')], [Text(0.5419377432901349, 0.25749462595867095, '14.1%'), Text(0.14014364141763858, 0.5834035993805697, '14.3%'), Text(-0.26221667482443045, 0.5396688016219011, '7.6%'), Text(-0.48528698463426023, 0.3528406758645993, '7.9%'), Text(-0.5961388280943722, 0.06795952941470788, '8.4%'), Text(-0.5749659317555924, -0.17150561891793353, '4.4%'), Text(-0.4843610831914014, -0.354110633968612, '6.5%'), Text(-0.3822283031321936, -0.4624948910903599, '1.4%'), Text(-0.30467598666328977, -0.5168873602156961, '3.6%'), Text(-0.10166564529653636, -0.5913240199471342, '7.9%'), Text(0.1232776151414561, -0.5871989693494318, '4.1%'), Text(0.3740938650373067, -0.4690989023025415, '10.8%'), Text(0.5338168082530224, -0.2739335963815975, '2.7%'), Text(0.5634163214714761, -0.20630571659445188, '1.2%'), Text(0.5757091920034719, -0.16899386451202753, '0.9%'), Text(0.5837686928409221, -0.13861498208635717, '0.8%'), Text(0.5930091513594608, -0.09132440201792806, '1.8%'), Text(0.5980229098108417, -0.0486682580474568, '0.5%'), Text(0.599149143101283, -0.03194220282006583, '0.4%'), Text(0.5998570567939487, -0.01309623665414283, '0.6%'), Text(0.5999997712557846, -0.0005239208013989852, '0.0%')])

# Add a title to the plot
plt.title("Movie Genre Distribution")
plt.show()

VISUALIZATION 2 SUMMARY

This visualization shows the distribution of movie genres in the dataset. By extracting the genres from each movie, I created a list that includes all the genres and then counted how many times each one appeared. The pie chart visualizes these counts, displaying the percentage of each genre’s presence in Hollywood’s highest-grossing films. The pastel color palette makes it easy to distinguish between genres, and the percentage labels add clarity to how each genre compares in terms of frequency.

When I connect this to the first chart, it adds context to the types of films that are driving box office success. While the first chart focused on individual movie performance, this chart reveals the broader genre trends that may explain why certain movies are more successful than others. It’s interesting to see which genres dominate the market and how that aligns with the biggest earners in Hollywood.

VISUALIZATION 3

# VISUALIZATION 3

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(r"E:\Downloads\movies.csv") 


distributor_yearly_revenue = df.groupby(['Distributor', 'Year'])['World Wide Sales (in $)'].sum().reset_index()


top_distributors = distributor_yearly_revenue.groupby('Distributor')['World Wide Sales (in $)'].sum().nlargest(10).index
distributor_yearly_revenue = distributor_yearly_revenue[distributor_yearly_revenue['Distributor'].isin(top_distributors)]


distributor_yearly_revenue['Rank'] = distributor_yearly_revenue.groupby('Year')['World Wide Sales (in $)'].rank(ascending=False, method="first")


pivot_df = distributor_yearly_revenue.pivot(index='Year', columns='Distributor', values='Rank')


plt.figure(figsize=(12, 6))
sns.lineplot(data=pivot_df, dashes=False, palette="tab10")
plt.gca().invert_yaxis()  
plt.title("Ranking of Top 10 Distributors Over Time")
plt.xlabel("Year")
plt.ylabel("Rank")
plt.legend(title="Distributor", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

VISUALIZATION 3 SUMMARY

My third visualization shows how the rankings of the top 10 distributors in Hollywood have shifted over time, based on their worldwide sales. By grouping the data by distributor and year, I can track how each distributor performed throughout the years. What stood out to me is how some distributors have maintained their dominance, while others have fluctuated more frequently in the rankings. The line chart highlights these changes, with each distributor represented by a different color for easy comparison. Inverting the y-axis helps clarify how lower ranks indicate better performance, emphasizing the lead distributors.

When I connect this to the first two visualizations, it adds depth to the story. The first chart showed the top 5 highest-grossing movies, revealing the films that led in global sales. The second chart provided insight into genre distribution, showing which types of films tend to dominate the box office. This chart zooms out to focus on the distributors behind those films, revealing how these companies have shaped the global market over time.

VISUALIZATION 4

import matplotlib.pyplot as plt
import numpy as np

# VISUALIZATION 4

movie = "Avatar"  
movie_data = df[df['Title'] == movie].iloc[0]

# Data for the waterfall chart
categories = ["Domestic Sales", "International Sales", "Total"]
values = [movie_data["Domestic Sales (in $)"], movie_data["International Sales (in $)"]]
total_value = sum(values)
values.append(total_value)

# Positions for bars
x_pos = np.arange(len(categories))
y_values = [0] + values[:-1]  

# Plot bars
fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(x_pos, values, width=0.5, color=['blue', 'green', 'gray'], bottom=y_values)

# Labels and formatting
for i, v in enumerate(values):
    ax.text(x_pos[i], v + total_value * 0.02, f"${v:,.0f}", ha='center', fontsize=10, fontweight='bold')

ax.set_xticks(x_pos)
ax.set_xticklabels(categories)
ax.set_ylabel("Gross Earnings ($)")
ax.set_title(f"Revenue Breakdown of {movie}")

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

VISUALIZATION 4 SUMMARY

The chart breaks down Avatar’s revenue by domestic and international sales, showing how each contributes to its total gross earnings. The waterfall chart starts with domestic sales, adds the international sales, and finishes with the total earnings. The connector line between the categories clearly visualizes the flow of the data, making it easy to compare the two markets. Looking back at the earlier charts, the comparison of Avatar’s revenue to the overall highest-grossing films puts things in perspective.

While the first chart highlighted the biggest blockbusters, this one zooms in on the specifics of Avatar’s earnings. The genre distribution from the second chart gives us context on the types of films that perform well globally, which adds another layer to understanding how movies like Avatar dominate both domestic and international markets. It’s interesting to see the balance of domestic vs. international sales and how that might reflect trends in global audience preferences.

VISUALIZATION 5

#library(reticulate)
#use_python("C:/Users/deano/anaconda3/python.exe")
#knitr::opts_chunk$set(echo = TRUE)

# VISUALIZATION 5
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


df = pd.read_csv(r"E:\Downloads\movies.csv")


top_movies = df.nlargest(15, 'World Wide Sales (in $)')


sns.set(style="whitegrid")


plt.figure(figsize=(10, 6))
sns.barplot(x='World Wide Sales (in $)', y='Title', data=top_movies, palette='viridis')


plt.title('Top 15 Movies by Worldwide Sales')
plt.xlabel('Worldwide Sales ($)')
plt.ylabel('Movie Title')


plt.show()

VISUALIZATION 5 SUMMARY

The chart shows how the top 15 highest-grossing movies contribute to the overall box office performance, organized by genre. Using a barplot.

Looking at the earlier visualizations, the top 5 movies by worldwide sales highlighted which films dominated the box office. This chart zooms out to show the broader genre context, helping to explain the market trends behind those big hits. The genre distribution from the second chart also comes into play, as it reveals the popularity of specific genres and how they align with the biggest financial successes. By combining both the genre trends and the top-performing films, this chart paints a more complete picture of how certain genres drive Hollywood’s box office success.

FINAL SUMMARY

When I look at these visualizations together, it really paints a bigger picture of what drives Hollywood’s success and how the industry’s financial powerhouses operate. The first chart showed us the top 5 highest-grossing movies, putting the spotlight on the films that dominate the box office and make the most money worldwide. From there, the second chart on the distributors’ rankings over time helped me see how the companies behind those big films are positioned and how their dominance has shifted year after year. It’s interesting to see how certain distributors stay at the top, while others see more fluctuation in their rankings.

The genre distribution chart was also really eye-opening. It’s clear that some genres are more common in the top-grossing films, and this gives us a sense of what type of content is driving those big earnings. It adds another layer to the whole story, helping us understand why certain movies perform so well globally. Then, the waterfall chart for Avatar brought it all together by breaking down how a major hit like that made its money by splitting it between domestic and international sales. It was a small but important detail that shows how global the box office really is.

This bar plot effectively visualizes the relationship between movie titles and their worldwide sales in the top 15 films. By plotting the sales alongside the movie titles, it creates a clear connection between the highest-grossing films and their financial success. This chart highlights the significance of certain movies in terms of revenue, offering valuable insight into the financial performance of popular films. It’s a straightforward yet powerful way to understand which movies are making the biggest impact at the box office.

Altogether, these visualizations give a clearer view of what’s happening behind the scenes in Hollywood, from the movies themselves to the distributors and the genres that dominate. It’s a combination of all these factors that really helps explain why certain films are able to pull in such massive amounts of money. It’s like piecing together a puzzle of the financial landscape of the film industry.