I have spent the last week or so exploring this dataset of movie information, which was extracted from the The Movie DB.

The data tables

Let’s start by taking a look at the structure of the main dataset:

## Observations: 45,466
## Variables: 25
## $ adult                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ belongs_to_collection <chr> "{'id': 10194, 'name': 'Toy Story Collec...
## $ budget                <dbl> 30000000, 65000000, 0, 16000000, 0, 6000...
## $ genres                <chr> "[{'id': 16, 'name': 'Animation'}, {'id'...
## $ homepage              <chr> "http://toystory.disney.com/toy-story", ...
## $ id                    <dbl> 862, 8844, 15602, 31357, 11862, 949, 118...
## $ imdb_id               <chr> "tt0114709", "tt0113497", "tt0113228", "...
## $ original_language     <chr> "en", "en", "en", "en", "en", "en", "en"...
## $ original_title        <chr> "Toy Story", "Jumanji", "Grumpier Old Me...
## $ overview              <chr> "Led by Woody, Andy's toys live happily ...
## $ popularity            <dbl> 21.946943, 17.015539, 11.712900, 3.85949...
## $ poster_path           <chr> "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg", "/vz...
## $ production_companies  <chr> "[{'name': 'Pixar Animation Studios', 'i...
## $ production_countries  <chr> "[{'iso_3166_1': 'US', 'name': 'United S...
## $ release_date          <date> 1995-10-30, 1995-12-15, 1995-12-22, 199...
## $ revenue               <dbl> 373554033, 262797249, 0, 81452156, 76578...
## $ runtime               <dbl> 81, 104, 101, 127, 106, 170, 127, 97, 10...
## $ spoken_languages      <chr> "[{'iso_639_1': 'en', 'name': 'English'}...
## $ status                <chr> "Released", "Released", "Released", "Rel...
## $ tagline               <chr> NA, "Roll the dice and unleash the excit...
## $ title                 <chr> "Toy Story", "Jumanji", "Grumpier Old Me...
## $ video                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ vote_average          <dbl> 7.7, 6.9, 6.5, 6.1, 5.7, 7.7, 6.2, 5.4, ...
## $ vote_count            <dbl> 5415, 2413, 92, 34, 173, 1886, 141, 45, ...
## $ year                  <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995...

A thing to note is that a number of the fields in the dataset contain JSON data, to allow the creator to capture multiple values within one CSV field. (It’s actually malformed JSON, and that was a headache and a half to deal with; I got the ‘genre’ one reading successfully, but eventually gave up on parsing any of the others.) Here’s a quick snapshot of the genre data table I extracted:

## Observations: 91,106
## Variables: 2
## $ id    <dbl> 862, 862, 862, 8844, 8844, 8844, 15602, 15602, 31357, 31...
## $ genre <chr> "Animation", "Comedy", "Family", "Adventure", "Fantasy",...

The id value is the movie ID from the main table; the same movie ID appears as many times as it had genres. If it didn’t have any genres, it will be missing from the table.

Spread of the data

The first thing I wanted to investigate was the spread of the data.

Years

Let’s start with years.

So, it looks like the data is fairly minimal in the early 1900s, gradually becoming more comprehensive as you get closer to the present day.

Another thing is that the data doesn’t just include released movies; there are ones in post production, cancelled, even rumoured. Let’s check the volumes of those.

So, it looks like there aren’t many movies in any status other than “released” in the dataset, and what is there should probably be ignored as the data is likely to be incomplete.

Languages

TMDB is an English website, so I thought I should check whether the data contained movies from other languages at all, and if so, how many.

As is to be expected, the dataset is heavily biased towards English releases. There are a small number of films in other languages, but the primary focus is on movies of interest to English speakers. This is an important caveat to note on the limitations of the dataset.

Revenue, Budget, and Return on Investment

What types of movies are the biggest successes? Let’s take a look at the ones that make back many times their budget in revenue. For the purposes of the following charts, Return on Investment (ROI) is calculated as revenue / budget; obviously these numbers may not include additional costs such as advertising spend. The revenue field also appears to be the US box office; films that did well internationally may not be represented properly.

First, the movies with the highest budget in the dataset:

As you might expect, this is basically a list of recent blockbusters. These are the ones that the studios are happy to throw buckets of cash at in the hopes that it pays off. It mostly does: with the exception of The Lone Ranger, all of these films made at least twice their budget in revenue.

Next. let’s run the same chart, but sorted by the revenue column:

Still blockbusters, but it’s interesting that it is a very different list of blockbusters. In fact, only a small number of rows in the data are in both the 20 highest budgets and revenues:

original_title budget_rank revenue_rank
Avatar 22 1
Star Wars: The Force Awakens 20 2
Titanic 33 3
The Avengers 27 4
Jurassic World 124 5
Furious 7 59 6
Avengers: Age of Ultron 3 7
Harry Potter and the Deathly Hallows: Part 2 236 8
Frozen 124 9
Beauty and the Beast 108 10
The Fate of the Furious 10 11
Iron Man 3 33 12
Minions 644 13
Captain America: Civil War 10 14
Transformers: Dark of the Moon 57 15
The Lord of the Rings: The Return of the King 396 16
Skyfall 33 17
Transformers: Age of Extinction 29 18
The Dark Knight Rises 10 19
Toy Story 3 33 20

Finally for this section, we’ll run that chart again but sort on the ROI column.

So, The Blair Witch Project is a spectacular outlier, here. It’s also worth noting that a fair number of the other films here (The Gallows, The Texas Chain Saw Massacre, Night of the Living Dead, Halloween, The Legend of Boggy Creek, and Blood Feast) are all horror movies; it seems they’re cheap to make and sometimes become cult classics.

Let’s run the chart again, excluding The Blair Witch Project so we can focus on the others.

The top hit here (the film with the Chinese title) is The Way of the Dragon, a 1972 martial arts flick starring Bruce Lee.

Aside from the horror films I already mentioned, there are also a handful of early Disney animated movies on the list; Disney was well known for under-paying its staff in the early days, to the point where the animators went on strike in 1941; this may well have contributed to the low budgetary costs of the films in question.

Genres

Now, let’s look at what genres of movies are in the dataset.

Nearly half of the around 45k movies in the dataset are tagged as “Drama”. Let’s take a look at some trends over time. Here’s a select group of genres, and the proportion of all films released that decade that were tagged with that genre:

The chart above lets you spot the downfall of the Western movie - popular up to the 1950s, then a slow decline through to the 1980s when it peters off to essentially nothing. Yet even at its peak, it didn’t hold a candle to the Dramas, Comedies, or Romances.

What about money by genre? Which types of films make the most money?

You can see there are a small number of outliers here. Let’s find out which films they are.

We’re back to the blockbusters, unsurprisingly: Avatar, Star Wars: The Force Awakens, Titanic, The Avengers, Jurassic World, and Furious 7. Let’s run the chart again without the movies that made over $1.5 billion and see what we get.

The genres in the chart are ordered by mean ROI. Therefore, we can see that on average, Horror films get the best value for money from the financers, followed by Documentary and Mystery. These are all films that can be cheap to produce, though they may have only limited audience appeal judging by the top-value revenues in each.

Let’s run a chart showing the ROI values directly:

Ah, the outliers strike again. The film making back 4000 times its budget is The Blair Witch Project, as we established earlier. Let’s zoom in the chart a bit, focus on movies making fifty times their budget or less. Which is still a lot of money! Just not quite as out there.

Ah, so the zoomed in chart shows that it’s at least partially that one movie pulling up the Horror and Mystery genres so high; if you examine the interquartile ranges of the boxplots, you’ll spot that Documentary has a higher 75th percentile than either. It’s still pretty good for both overall, but it isn’t a guaranteed win or anything. Not that anything ever is in the movie business anyway.

Genre combinations

I have been mentioning repeatedly that the same film could be tagged with multiple genres. Let’s look at the relationship between them. Which genres appear together most often?

By far the most common combinations in the dataset are Drama/Romance and Drama/Comedy. The next most frequent are Drama/Thriller and Comedy/Romance, but after that it falls off a cliff fairly rapidly. These frequencies map fairly closely to the overall proportions, as demonstrated in the next chart.

Conclusion

So, after all that, what have we learned?