Decoding Cinematic Success: A Data Analysis of Over a Million Movies

Author

Mallory Bowling

Introduction

As a fan of movies and business analytics student, I’ve been curious about what makes a film succeed. Is it budget, timing, genre, or something else? With this project, I want to explore real-world data to attempt to answer those questions.

This project uses the “Full TMDb Movies Dataset 2024” from Kaggle, which has a collection of over 1 million movies sourced from The Movie Database (TMDb). TMDb has a comprehensive repository of movie information with a mix of fundamental characteristics (e.g., titles, release dates) to nuanced attributes (e.g., ratings, revenue, genre). To enrich the analysis further with focused lens on critically acclaimed movies, I’m incorporating a secondary dataset of IMDb’s Top 250 movies.

To access the dataset on Kaggle, see: https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies/data.

To access IMDb’s Top 250 Movies list, see: https://www.imdb.com/chart/top/

Line of Inquiry

What factors most significantly predict a movie’s success in terms of revenue and popularity?

Together, I’m using these complementary datasets to address several questions about the dynamics of the film industry. Several of my goals are to:

  1. Identify trends in movie release dates and their impact on generated revenue.
  2. Analyze the relationship between budget, revenue, and popularity to determine factors that contribute to a movie’s success.
  3. Explore the impact of movie genres on popularity and revenue.
  4. Find a correlation between runtime and audience engagement.
  5. Identify successful production companies.
  6. Visualize movie popularity over time and identify popular genres in different periods.

Data Dictionary

The data dictionary for the TMDb Million Movies CSV is as follows:

Column Name Data Type Definition
id int Unique identifier for each movie
title str Title of the movie
vote_average float Average vote or rating given by viewers
vote_count int Total count of votes received for the movie
status str The status of the movie (e.g., released, rumored, post-production, etc.)
release_date str Date the movie was released
revenue int Total revenue generated by the movie
runtime int Duration of the movie in minutes
adult bool Indicates if the movie is suitable for adult audiences only
backdrop_path str URL of the backdrop image for the movie
budget int Budget allocated for the movie
homepage str Official homepage URL of the movie
imdb_id str IMDb ID of the movie
original_language str Original language in which the movie was produced
original_title str Original title of the movie
overview str Brief description or summary of the movie
popularity float Popularity score of the movie
poster_path str URL of the movie poster image
tagline str Catchphrase or memorable line associated with the movie
genre str List of genres the movie belongs to
production_companies str List of production companies involved in the movie
production_countries str List of countries involved in the movie production
spoken_languages str List of languages spoken in the movie
keywords str Keywords associated with the movie

After initially looking over the dataset, missing data is included. The columns with the most null values include imdb_id (49%), tagline (86%), genre (41%), production_companies (56%), production_countries (46%), spoken_languages (44%), and keywords (74%). URL columns also have predominantly null values, but I won’t be using them.

Analysis

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

NOTES

NOTES

NOTES

NOTES

Average popularity for all movies in the dataset is 1.17.

NOTES