I have spent the last week or so exploring this dataset of movie information, which was extracted from the The Movie DB.
Let’s start by taking a look at the structure of the main dataset:
## Observations: 45,466
## Variables: 25
## $ adult <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ belongs_to_collection <chr> "{'id': 10194, 'name': 'Toy Story Collec...
## $ budget <dbl> 30000000, 65000000, 0, 16000000, 0, 6000...
## $ genres <chr> "[{'id': 16, 'name': 'Animation'}, {'id'...
## $ homepage <chr> "http://toystory.disney.com/toy-story", ...
## $ id <dbl> 862, 8844, 15602, 31357, 11862, 949, 118...
## $ imdb_id <chr> "tt0114709", "tt0113497", "tt0113228", "...
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en"...
## $ original_title <chr> "Toy Story", "Jumanji", "Grumpier Old Me...
## $ overview <chr> "Led by Woody, Andy's toys live happily ...
## $ popularity <dbl> 21.946943, 17.015539, 11.712900, 3.85949...
## $ poster_path <chr> "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg", "/vz...
## $ production_companies <chr> "[{'name': 'Pixar Animation Studios', 'i...
## $ production_countries <chr> "[{'iso_3166_1': 'US', 'name': 'United S...
## $ release_date <date> 1995-10-30, 1995-12-15, 1995-12-22, 199...
## $ revenue <dbl> 373554033, 262797249, 0, 81452156, 76578...
## $ runtime <dbl> 81, 104, 101, 127, 106, 170, 127, 97, 10...
## $ spoken_languages <chr> "[{'iso_639_1': 'en', 'name': 'English'}...
## $ status <chr> "Released", "Released", "Released", "Rel...
## $ tagline <chr> NA, "Roll the dice and unleash the excit...
## $ title <chr> "Toy Story", "Jumanji", "Grumpier Old Me...
## $ video <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ vote_average <dbl> 7.7, 6.9, 6.5, 6.1, 5.7, 7.7, 6.2, 5.4, ...
## $ vote_count <dbl> 5415, 2413, 92, 34, 173, 1886, 141, 45, ...
## $ year <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995...
A thing to note is that a number of the fields in the dataset contain JSON data, to allow the creator to capture multiple values within one CSV field. (It’s actually malformed JSON, and that was a headache and a half to deal with; I got the ‘genre’ one reading successfully, but eventually gave up on parsing any of the others.) Here’s a quick snapshot of the genre data table I extracted:
## Observations: 91,106
## Variables: 2
## $ id <dbl> 862, 862, 862, 8844, 8844, 8844, 15602, 15602, 31357, 31...
## $ genre <chr> "Animation", "Comedy", "Family", "Adventure", "Fantasy",...
The id value is the movie ID from the main table; the same movie ID appears as many times as it had genres. If it didn’t have any genres, it will be missing from the table.
The first thing I wanted to investigate was the spread of the data.
Let’s start with years.
So, it looks like the data is fairly minimal in the early 1900s, gradually becoming more comprehensive as you get closer to the present day.
Another thing is that the data doesn’t just include released movies; there are ones in post production, cancelled, even rumoured. Let’s check the volumes of those.
So, it looks like there aren’t many movies in any status other than “released” in the dataset, and what is there should probably be ignored as the data is likely to be incomplete.
TMDB is an English website, so I thought I should check whether the data contained movies from other languages at all, and if so, how many.
As is to be expected, the dataset is heavily biased towards English releases. There are a small number of films in other languages, but the primary focus is on movies of interest to English speakers. This is an important caveat to note on the limitations of the dataset.
They’re kind of the bane of the movie-goer’s existence these days; every film in the box office seems to be an adaptation, mostly of the same classic novels that were being made twenty years ago. Let’s take a look at which titles appear most often in the dataset, and how many different languages those films were made in.
As is to be expected, the list is basically an English major’s assigned reading list: lots of classic English literature, but with an eye to classic children’s literature as well, such as Heidi. The handful of exceptions are more generic titles that are presumably co-incidental rather than remakes: “Home”, “Blackout”, “Eden”. Given the strong English bias in the dataset, it’s unsurprising that most of these only appear in one language; but it’s also unsurprising that stories like Les Mis, originally written in French, have been made in more than one.
What types of movies are the biggest successes? Let’s take a look at the ones that make back many times their budget in revenue. For the purposes of the following charts, Return on Investment (ROI) is calculated as revenue / budget; obviously these numbers may not include additional costs such as advertising spend. The revenue field also appears to be the US box office; films that did well internationally may not be represented properly.
First, the movies with the highest budget in the dataset:
As you might expect, this is basically a list of recent blockbusters. These are the ones that the studios are happy to throw buckets of cash at in the hopes that it pays off. It mostly does: with the exception of The Lone Ranger, all of these films made at least twice their budget in revenue.
Next. let’s run the same chart, but sorted by the revenue column:
Still blockbusters, but it’s interesting that it is a very different list of blockbusters. In fact, only a small number of rows in the data are in both the 20 highest budgets and revenues:
| original_title | budget_rank | revenue_rank |
|---|---|---|
| Avatar | 22 | 1 |
| Star Wars: The Force Awakens | 20 | 2 |
| Titanic | 33 | 3 |
| The Avengers | 27 | 4 |
| Jurassic World | 124 | 5 |
| Furious 7 | 59 | 6 |
| Avengers: Age of Ultron | 3 | 7 |
| Harry Potter and the Deathly Hallows: Part 2 | 236 | 8 |
| Frozen | 124 | 9 |
| Beauty and the Beast | 108 | 10 |
| The Fate of the Furious | 10 | 11 |
| Iron Man 3 | 33 | 12 |
| Minions | 644 | 13 |
| Captain America: Civil War | 10 | 14 |
| Transformers: Dark of the Moon | 57 | 15 |
| The Lord of the Rings: The Return of the King | 396 | 16 |
| Skyfall | 33 | 17 |
| Transformers: Age of Extinction | 29 | 18 |
| The Dark Knight Rises | 10 | 19 |
| Toy Story 3 | 33 | 20 |
Finally for this section, we’ll run that chart again but sort on the ROI column.
So, The Blair Witch Project is a spectacular outlier, here. It’s also worth noting that a fair number of the other films here (The Gallows, The Texas Chain Saw Massacre, Night of the Living Dead, Halloween, The Legend of Boggy Creek, and Blood Feast) are all horror movies; it seems they’re cheap to make and sometimes become cult classics.
Let’s run the chart again, excluding The Blair Witch Project so we can focus on the others.
The top hit here (the film with the Chinese title) is The Way of the Dragon, a 1972 martial arts flick starring Bruce Lee.
Aside from the horror films I already mentioned, there are also a handful of early Disney animated movies on the list; Disney was well known for under-paying its staff in the early days, to the point where the animators went on strike in 1941; this may well have contributed to the low budgetary costs of the films in question.
Now, let’s look at what genres of movies are in the dataset.
Nearly half of the around 45k movies in the dataset are tagged as “Drama”. Let’s take a look at some trends over time. Here’s a select group of genres, and the proportion of all films released that decade that were tagged with that genre:
The chart above lets you spot the downfall of the Western movie - popular up to the 1950s, then a slow decline through to the 1980s when it peters off to essentially nothing. Yet even at its peak, it didn’t hold a candle to the Dramas, Comedies, or Romances.
What about money by genre? Which types of films make the most money?
You can see there are a small number of outliers here. Let’s find out which films they are.
We’re back to the blockbusters, unsurprisingly: Avatar, Star Wars: The Force Awakens, Titanic, The Avengers, Jurassic World, and Furious 7. Let’s run the chart again without the movies that made over $1.5 billion and see what we get.
The genres in the chart are ordered by mean ROI. Therefore, we can see that on average, Horror films get the best value for money from the financers, followed by Documentary and Mystery. These are all films that can be cheap to produce, though they may have only limited audience appeal judging by the top-value revenues in each.
Let’s run a chart showing the ROI values directly:
Ah, the outliers strike again. The film making back 4000 times its budget is The Blair Witch Project, as we established earlier. Let’s zoom in the chart a bit, focus on movies making fifty times their budget or less. Which is still a lot of money! Just not quite as out there.
Ah, so the zoomed in chart shows that it’s at least partially that one movie pulling up the Horror and Mystery genres so high; if you examine the interquartile ranges of the boxplots, you’ll spot that Documentary has a higher 75th percentile than either. It’s still pretty good for both overall, but it isn’t a guaranteed win or anything. Not that anything ever is in the movie business anyway.
I have been mentioning repeatedly that the same film could be tagged with multiple genres. Let’s look at the relationship between them. Which genres appear together most often?
By far the most common combinations in the dataset are Drama/Romance and Drama/Comedy. The next most frequent are Drama/Thriller and Comedy/Romance, but after that it falls off a cliff fairly rapidly. These frequencies map fairly closely to the overall proportions, as demonstrated in the next chart.
So, after all that, what have we learned?