Decoding Cinematic Success: A Data Analysis of Over a Million Movies
Introduction
As a fan of movies and business analytics student, I’ve been curious about what makes a film succeed. Is it budget, timing, genre, or something else? With this project, I want to explore real-world data to attempt to answer those questions.
This project uses the “Full TMDb Movies Dataset 2024” from Kaggle, which has a collection of over 1 million movies sourced from The Movie Database (TMDb). TMDb has a comprehensive repository of movie information with a mix of fundamental characteristics (e.g., titles, release dates) to nuanced attributes (e.g., ratings, revenue, genre). To enrich the analysis further with focused lens on critically acclaimed movies, I’m incorporating a secondary dataset of IMDb’s Top 250 movies.
To access the dataset on Kaggle, see: https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies/data.
To access IMDb’s Top 250 Movies list, see: https://www.imdb.com/chart/top/
Line of Inquiry
What factors most significantly predict a movie’s success in terms of revenue and popularity?
Together, I’m using these complementary datasets to address several questions about the dynamics of the film industry. Several of my goals are to:
- Identify trends in movie release dates and their impact on generated revenue.
- Analyze the relationship between budget, revenue, and popularity to determine factors that contribute to a movie’s success.
- Explore the impact of movie genres on popularity and revenue.
- Find a correlation between runtime and audience engagement.
- Identify successful production companies.
- Visualize movie popularity over time and identify popular genres in different periods.
Data Dictionary
The data dictionary for the TMDb Million Movies CSV is as follows:
| Column Name | Data Type | Definition |
|---|---|---|
| id | int | Unique identifier for each movie |
| title | str | Title of the movie |
| vote_average | float | Average vote or rating given by viewers |
| vote_count | int | Total count of votes received for the movie |
| status | str | The status of the movie (e.g., released, rumored, post-production, etc.) |
| release_date | str | Date the movie was released |
| revenue | int | Total revenue generated by the movie |
| runtime | int | Duration of the movie in minutes |
| adult | bool | Indicates if the movie is suitable for adult audiences only |
| backdrop_path | str | URL of the backdrop image for the movie |
| budget | int | Budget allocated for the movie |
| homepage | str | Official homepage URL of the movie |
| imdb_id | str | IMDb ID of the movie |
| original_language | str | Original language in which the movie was produced |
| original_title | str | Original title of the movie |
| overview | str | Brief description or summary of the movie |
| popularity | float | Popularity score of the movie |
| poster_path | str | URL of the movie poster image |
| tagline | str | Catchphrase or memorable line associated with the movie |
| genre | str | List of genres the movie belongs to |
| production_companies | str | List of production companies involved in the movie |
| production_countries | str | List of countries involved in the movie production |
| spoken_languages | str | List of languages spoken in the movie |
| keywords | str | Keywords associated with the movie |
After initially looking over the dataset, missing data is included. The columns with the most null values include imdb_id (49%), tagline (86%), genre (41%), production_companies (56%), production_countries (46%), spoken_languages (44%), and keywords (74%). URL columns also have predominantly null values, but I won’t be using them.
Analysis
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
NOTES
NOTES
NOTES
NOTES
Average popularity for all movies in the dataset is 1.17.
NOTES