Decoding Cinematic Success: A Data Analysis of Over a Million Movies
Introduction
As a fan of movies and business analytics student, I’ve been curious about what makes a film succeed. Is it budget, timing, genre, or something else? With this project, I want to explore real-world data to attempt to answer those questions.
This project uses the “Full TMDb Movies Dataset 2024” from Kaggle, which has a collection of over 1 million movies sourced from The Movie Database (TMDb). TMDb has a comprehensive repository of movie information with a mix of fundamental characteristics (e.g., titles, release dates) to nuanced attributes (e.g., ratings, revenue, genre). To enrich the analysis further with focused lens on critically acclaimed movies, I’m incorporating a secondary dataset of IMDb’s Top 250 movies.
To access the dataset on Kaggle, see: https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies/data.
To access IMDb’s Top 250 Movies list, see: https://www.imdb.com/chart/top/
Line of Inquiry
What factors most significantly predict a movie’s success in terms of revenue and popularity?
Together, I’m using these complementary datasets to address several questions about the dynamics of the film industry. Several of my goals are to:
- Identify trends in movie release dates and their impact on generated revenue.
- Analyze the relationship between budget, revenue, and popularity to determine factors that contribute to a movie’s success.
- Explore the impact of movie genres on popularity and revenue.
- Find a correlation between runtime and audience engagement.
- Identify successful production companies.
- Visualize movie popularity over time and identify popular genres in different periods.
Data Dictionary
The data dictionary for the TMDb Million Movies CSV is as follows:
Column Name | Data Type | Definition |
---|---|---|
id | int | Unique identifier for each movie |
title | str | Title of the movie |
vote_average | float | Average vote or rating given by viewers |
vote_count | int | Total count of votes received for the movie |
status | str | The status of the movie (e.g., released, rumored, post-production, etc.) |
release_date | str | Date the movie was released |
revenue | int | Total revenue generated by the movie |
runtime | int | Duration of the movie in minutes |
adult | bool | Indicates if the movie is suitable for adult audiences only |
backdrop_path | str | URL of the backdrop image for the movie |
budget | int | Budget allocated for the movie |
homepage | str | Official homepage URL of the movie |
imdb_id | str | IMDb ID of the movie |
original_language | str | Original language in which the movie was produced |
original_title | str | Original title of the movie |
overview | str | Brief description or summary of the movie |
popularity | float | Popularity score of the movie |
poster_path | str | URL of the movie poster image |
tagline | str | Catchphrase or memorable line associated with the movie |
genre | str | List of genres the movie belongs to |
production_companies | str | List of production companies involved in the movie |
production_countries | str | List of countries involved in the movie production |
spoken_languages | str | List of languages spoken in the movie |
keywords | str | Keywords associated with the movie |
After initially looking over the dataset, missing data is included. The columns with the most null values include imdb_id (49%), tagline (86%), genre (41%), production_companies (56%), production_countries (46%), spoken_languages (44%), and keywords (74%). URL columns also have predominantly null values, but I won’t be using them.
TMDb Analysis
To better understand patterns in movie revenue, I created a visualization that looks at how the average film revenues vary by release month. This provides insight into potential seasonal trends and strategic release timing in the industry.
The bar chart above displays the average movie revenue for each month, calculated from a filtered dataset that includes only movies with valid release dates and non-zero revenues.
June, May, and July stand out as the highest-earning months, with average revenues exceeding $60 million in June. This suggests that summer blockbusters drive significant box office returns.
December also performs strongly, likely due to holiday releases, which often attract large audiences.
Conversely, January, September, and October show lower average revenues, indicating these months may be less favorable for major releases, possibly due to lower audience turnout or fewer high-budget films.
The trend suggests that studios strategically release major films during summer and holiday periods, aligning with school breaks and high consumer spending times.
Then, I created a scatterplot that examines the relationship between a movie’s budget and its revenue to see if higher investments tend to result in higher box office returns.
The scatter plot shows each movie as a green dot, with budget on the x-axis and revenue on the y-axis, including only entries with non-zero and non-missing values for both. A black linear trend line overlays the data to show the general relationship.
Key insights from the plot include:
There is a positive linear trend, indicating that higher budgets are generally associated with higher revenues. This suggests that investment in production often pays off, at least in terms of gross earnings.
However, the spread is wide, especially at lower budget levels, where some low-budget films achieve high revenues — highlighting the existence of low-budget breakout successes.
A few extreme outliers (e.g., budgets or revenues exceeding hundreds of millions or even billions) stretch the plot scale, potentially masking mid-range patterns.
The dense clustering near the lower end of the budget axis suggests most films operate within a relatively modest budget range.
Overall, while bigger budgets are often tied to higher box office performance, this plot also hints at risk and variability. Spending more doesn’t guarantee massive success.
To better understand how movie genres resonate with audiences, I analyzed the average popularity across genres, identifying which types of movies attract more attention according the the TMDb’s popularity metric.
This horizontal bar chart ranks the top 10 most popular genres, based on the average popularity of movies within each genre. The dataset was transformed to handle multiple genres per movie and filtered to include only genres with more than 100 films for statistical reliability.
Adventure, Action, and Thriller emerge as the most popular genres, suggesting that high-energy, plot-driven films tend to garner more attention from viewers.
Other strong performers include Science Fiction and Fantasy, which may benefit from strong fan bases and big-budget productions that drive visibility.
Genres like Romance and War round out the list but appear slightly less popular on average.
However, it’s important to interpret these results cautiously. The TMDb popularity metric has a mean of only 1.17 and a very high standard deviation of 7.33, which implies that a small number of extremely popular titles can skew the averages significantly. This could explain why some genres rank higher. This is not necessarily due to consistent popularity but instead because of a few standout titles.
To see the typical duration of films in the dataset and identify common runtime ranges, I made a histogram that shows the distribution of their lengths.
- There’s still a very high bar at the beginning of the x-axis (around 0-5 minutes), indicating a large number of movies with extremely short runtimes. While I filtered for
runtime > 0
, there might still be entries with minimal recorded runtime.
- The most prominent peak centers around the 90-100 minute mark. A large portion of the movies in the filtered dataset have a typical feature film duration.
- A distinct secondary peak is noticeable around the 60-70 minute range. This suggests another common category of movie length, potentially including more independent films, documentaries, or older films that might have had shorter standard runtimes.
- Following the primary peak, the number of movies in each runtime bin generally decreases as the duration increases. This indicates that longer movies are less frequent.
The horizontal bar chart ranks the top 10 production companies by their total accumulated revenue. The companies are ordered from top to bottom based on their total revenue, besides the null value at the top.
The chart prominently features well-established major Hollywood studios such as Warner Bros. Pictures, Universal Pictures, and 20th Century Fox at the top, indicating their significant and consistent contribution to the overall box office revenue.
The presence of an “NA” category with a high revenue suggests a substantial portion of movies in the dataset have missing or unspecified production company information.
The total revenue among the top 10 production companies varies considerably, with the leading companies generating significantly more revenue than those at the bottom of the list.
The high revenue figures for companies like Walt Disney Pictures and Marvel Studios likely reflect the success of their large-scale franchises and blockbuster movies, which tend to generate substantial box office earnings.
The inclusion of New Line Cinema, while lower in the ranking, signifies that production companies with a focus on specific genres or independent films can also achieve significant financial success.
The presence of Metro-Goldwyn-Mayer (MGM), a historically significant studio, in the top 10 emphasizes the long-term impact and accumulated success of established production houses over the years.
Connecting the Trends
The success of summer and holiday releases (high revenue) often aligns with big-budget productions from major studios, frequently falling within the popular Adventure, Action, and Science Fiction genres.
The presence of successful low-budget films suggests that factors beyond pure financial investment, such as compelling storytelling or strong audience appeal, can also drive revenue, potentially explaining the success of films in various genres and from different production companies.
The bimodal distribution of runtimes might reflect different production strategies or target audiences for different types of films, with major blockbusters often aiming for standard feature length and other types of content finding success in shorter durations.
IMDb Top 250 Movies Analysis
To create a focused dataset of critically acclaimed movies for comparative analysis against the broader film landscape, I web scraped the IMDb Top 250 movies list. I used the rvest and polite packages to access the HTML elements of title, release year, runtime, and the movie’s rank on this list.
Long movies have a higher median user rating compared the medium- and short-length movies. Short and medium have nearly identical median user ratings, but short-length have a higher upper quartile compared to medium-length. Overall, this depicts that if a movie is longer than 120 minutes, it is more likely to have a higher IMDB user rating.
The analysis of the IMDb Top 250 suggests that longer movies (> 2 hours) tend to receive higher median user ratings compared to shorter and medium-length films within this highly-rated subset. The runtime distribution in the main TMDb dataset shows a primary peak around the standard feature length (90-100 minutes) and a secondary peak for shorter films (60-70 minutes). Very long films are less frequent overall.
This contrast suggests that while the majority of movies produced fall within the standard or shorter runtime categories, critical acclaim (as represented by the IMDb Top 250 user ratings) leans towards longer films. This could indicate that longer runtimes allow for more complex storytelling, character development, and epic scope, which resonate with critical audiences. However, commercial success (as implied by the volume of films in the main dataset’s runtime distribution) is achieved across various lengths.
Findings and Conclusion
This project delved into a comprehensive dataset of over one million movies from TMDb, supplemented by a focused analysis of the IMDb Top 250 critically acclaimed films, to uncover patterns and factors influencing cinematic success. Through various visualizations and statistical explorations, several key trends and conclusions emerged:
Strategic release timing is crucial.
Budget influences revenue but doesn’t guarantee it.
Action-oriented genres dominate popularity.
Major studios are revenue powerhouses.
Longer runtimes correlate with higher ratings.
Critical success spans genres.
Budget is not the sole determinant of critical praise.
While commercial success is often driven by strategic release timing, budget investment, and catering to popular genres, critical acclaim appears to be more closely associated with longer runtimes and strong artistic merit across diverse genres, irrespective of budget constraints. The runtime analysis displays the prevalence of standard-length films in the overall market, while the IMDb Top 250 suggests a preference for more extended narratives among highly-rated movies. This comparison views the distinct factors that contribute to a film’s financial performance versus its critical recognition. Further research could delve deeper into the characteristics of the very short films and the “NA” production company category to gain a more complete understanding of the dataset.