Decoding Cinematic Success: A Data Analysis of Over a Million Movies

Author

Mallory Bowling

Introduction

As a fan of movies and business analytics student, I’ve been curious about what makes a film succeed. Is it budget, timing, genre, or something else? With this project, I want to explore real-world data to attempt to answer those questions.

This project uses the “Full TMDb Movies Dataset 2024” from Kaggle, which has a collection of over 1 million movies sourced from The Movie Database (TMDb). TMDb has a comprehensive repository of movie information with a mix of fundamental characteristics (e.g., titles, release dates) to nuanced attributes (e.g., ratings, revenue, genre). To enrich the analysis further with focused lens on critically acclaimed movies, I’m incorporating a secondary dataset of IMDb’s Top 250 movies.

To access the dataset on Kaggle, see: https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies/data.

To access IMDb’s Top 250 Movies list, see: https://www.imdb.com/chart/top/

Line of Inquiry

What factors most significantly predict a movie’s success in terms of revenue and popularity?

Together, I’m using these complementary datasets to address several questions about the dynamics of the film industry. Several of my goals are to:

Identify trends in movie release dates and their impact on generated revenue.
Analyze the relationship between budget, revenue, and popularity to determine factors that contribute to a movie’s success.
Explore the impact of movie genres on popularity and revenue.
Find a correlation between runtime and audience engagement.
Identify successful production companies.
Visualize movie popularity over time and identify popular genres in different periods.

Data Dictionary

The data dictionary for the TMDb Million Movies CSV is as follows:

Column Name	Data Type	Definition
id	int	Unique identifier for each movie
title	str	Title of the movie
vote_average	float	Average vote or rating given by viewers
vote_count	int	Total count of votes received for the movie
status	str	The status of the movie (e.g., released, rumored, post-production, etc.)
release_date	str	Date the movie was released
revenue	int	Total revenue generated by the movie
runtime	int	Duration of the movie in minutes
adult	bool	Indicates if the movie is suitable for adult audiences only
backdrop_path	str	URL of the backdrop image for the movie
budget	int	Budget allocated for the movie
homepage	str	Official homepage URL of the movie
imdb_id	str	IMDb ID of the movie
original_language	str	Original language in which the movie was produced
original_title	str	Original title of the movie
overview	str	Brief description or summary of the movie
popularity	float	Popularity score of the movie
poster_path	str	URL of the movie poster image
tagline	str	Catchphrase or memorable line associated with the movie
genre	str	List of genres the movie belongs to
production_companies	str	List of production companies involved in the movie
production_countries	str	List of countries involved in the movie production
spoken_languages	str	List of languages spoken in the movie
keywords	str	Keywords associated with the movie

After initially looking over the dataset, missing data is included. The columns with the most null values include imdb_id (49%), tagline (86%), genre (41%), production_companies (56%), production_countries (46%), spoken_languages (44%), and keywords (74%). URL columns also have predominantly null values, but I won’t be using them.

Analysis

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

NOTES

NOTES

NOTES

NOTES

Average popularity for all movies in the dataset is 1.17.

NOTES