Decoding Cinematic Success: A Data Analysis of Over a Million Movies
Introduction
As a fan of movies and business analytics student, I’ve been curious about what makes a film succeed. Is it budget, timing, genre, or something else? With this project, I want to explore real-world data to attempt to answer those questions.
This project uses the “Full TMDb Movies Dataset 2024” from Kaggle, which has a collection of over 1 million movies sourced from The Movie Database (TMDb). TMDb has a comprehensive repository of movie information with a mix of fundamental characteristics (e.g., titles, release dates) to nuanced attributes (e.g., ratings, revenue, genre). To enrich the analysis further with focused lens on critically acclaimed movies, I’m incorporating a secondary dataset of IMDb’s Top 250 movies.
To access the dataset on Kaggle, see: https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies/data.
To access IMDb’s Top 250 Movies list, see: https://www.imdb.com/chart/top/
Line of Inquiry
What factors most significantly predict a movie’s success in terms of revenue and popularity?
Together, I’m using these complementary datasets to address several questions about the dynamics of the film industry. Several of my goals are to:
- Identify trends in movie release dates and their impact on generated revenue.
- Analyze the relationship between budget, revenue, and popularity to determine factors that contribute to a movie’s success.
- Explore the impact of movie genres on popularity and revenue.
- Find a correlation between runtime and audience engagement.
- Identify successful production companies.
- Visualize movie popularity over time and identify popular genres in different periods.
Data Dictionary
The data dictionary for the TMDb Million Movies CSV is as follows:
Column Name | Data Type | Definition |
---|---|---|
id | int | Unique identifier for each movie |
title | str | Title of the movie |
vote_average | float | Average vote or rating given by viewers |
vote_count | int | Total count of votes received for the movie |
status | str | The status of the movie (e.g., released, rumored, post-production, etc.) |
release_date | str | Date the movie was released |
revenue | int | Total revenue generated by the movie |
runtime | int | Duration of the movie in minutes |
adult | bool | Indicates if the movie is suitable for adult audiences only |
backdrop_path | str | URL of the backdrop image for the movie |
budget | int | Budget allocated for the movie |
homepage | str | Official homepage URL of the movie |
imdb_id | str | IMDb ID of the movie |
original_language | str | Original language in which the movie was produced |
original_title | str | Original title of the movie |
overview | str | Brief description or summary of the movie |
popularity | float | Popularity score of the movie |
poster_path | str | URL of the movie poster image |
tagline | str | Catchphrase or memorable line associated with the movie |
genre | str | List of genres the movie belongs to |
production_companies | str | List of production companies involved in the movie |
production_countries | str | List of countries involved in the movie production |
spoken_languages | str | List of languages spoken in the movie |
keywords | str | Keywords associated with the movie |
After initially looking over the dataset, missing data is included. The columns with the most null values include imdb_id (49%), tagline (86%), genre (41%), production_companies (56%), production_countries (46%), spoken_languages (44%), and keywords (74%). URL columns also have predominantly null values, but I won’t be using them.
Analysis
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
NOTES
NOTES
NOTES
NOTES
Average popularity for all movies in the dataset is 1.17.
NOTES