Analyzing Movie Scores on IMDb and Rotten Tomatoes

Xinghui Song

Motivation

While it is fun to just go to a movie theatre and hope to be pleasantly surprised by a new release, many people prefer to go to movie review sites for recommendations. IMDb and Rotten Tomatoes are arguably two of the most popular websites where people can find reviews and ratings for nearly any movie. Both of these websites provide two sets of ratings: critics scores and audience scores. IMDb uses Metascore to represent critics opinions while Rotten Tomatoes uses Tomatometer to achieve the same goal. For a single movie, one can get four different scores just within these two websites. Therefore it can be confusing to some viewers as to which one to trust, or how to interpret the potentially large differences in scores.

This project aims to obtain insights regarding the interpretation and comparison among these different scoring systems through data analysis and visualization. One focus in this project would be to visualize the distributions of scores and see how they vary across the years and across different genres. The former can give us a temporal view of the movie qualities according to critics and viewers, while the latter will shed some light on whether or not particular genres tend to be favored by critics and/or viewers. This can give viewers a better understanding of where a score stands comparing to other movies in the same time or same genre, which is more informative than just looking at the sheer value of a score. Another objective of this project is to find out what factors into a movie getting vastly different scores from different sources. For example, Batman v Superman received a notoriously low Tomatometer score of 27%, which is almost 40 points than its audience score. It could be interesting to determine if this happens more often to movies with certain themes, or maybe even to certain directors. Finally, the project will also presents some rankings of actors and directors by averaging the scores of movies they appear in.

Data Sources

IMDb + Rotten Tomatoes Datasets from OCDF

The data can be found and downloaded from the Open Collaboration Data Factories website. There are 12 data tables in total, but not all of them will be used in this project. The datasets are a derivative of the MovieLens 10M dataset published by GroupLens research group, with some additional columns added through web scraping. Below is a more detailed description of the tables used in this project:

  • movies: The main table, containing all ratings and review counts data from Rotten Tomatoes. There is an id column that links the main table to rest of the tables. Also there is an imdbID column that can be used to query IMDb ratings using API.
  • movie_genres: This table links movie IDs to one or more genres from 20 different genres. Each movie averagely has two genres.
  • movie_actors: Links movie IDs to actors appeared in those movies. Each actor is associated with a ranking variable, which is based on the order actors are listed on the movie’s IMDb page. Also each actor is associated with a unique id to avoid name confusions.
  • movie_directors: Links movie IDs to directors. Like the actors table, each director has a unique id.

These tables cover information of 10197 movies from year 1903 to 2011 (Note: There are actually duplicated records in the original dataset. After removing them, the number of movies goes down to 9413.). However, years towards both ends of the time frame have much fewer movies than years in the middle. For the purpose of having a reasonable volume of movies for each year, we eliminate the years with fewer than 10 movies. As a result, the time frame is narrowed to from year 1930 to 2009, with 9357 movies in total.

OMDB API

The OMDB API is a RESTful web service to obtain movie information. It can be used to extract a wide range of variables just by providing a valid IMDb ID (for movie titles, it takes the format: tt + a seven-digit numeric id). The url request takes the form of “http://www.omdbapi.com/?apikey=xxxxxxxx&i=tt0031381”, where “tt0031381” is the IMDb ID of the movie Gone with the Wind. The requests return json strings that can be loaded into python and added as additional columns to the main table movies. In this case, most data we need are already covered by the OCDF dataset. We only need the OMDB API to obtain the Metascores and IMDb user scores. The API requests are made possible thanks to fact that the IMDb ID of each movie is readily available in the main table. By using OMDB API, we pulled the Metascore and IMDb user score for each of the 9357 movies and stored the data as new columns Metascore and imdbRating.

Data Manipulation Methods

In this section, we will go through the data manipulation I’ve done to prepare and process the data. Some of these steps are already mentioned in Data Sources section, and they will be explained in more details here.

General Workflow

As mentioned in the previous section, the original data is stored in DAT format. The read.delim function in R can import DAT files into R dataframes with ease. Before the data is read into R, we first use Vim global command to delete some incorrectly written actor data to ensure there won’t be formatting errors when exporting the data to TSV or JSON files. After importing the downloaded data files in R, we cleared out duplicated movie records using dplyr package. The resulting R dataframes were then written as TSV files and imported into python. In python, we added columns Metascore and imdbRating to the main table movies using pandas module, requests module and OMDB API. After the data cleaning, the data stored in pandas DataFrames were exported as JSON files through the to_json method provided by pandas. The JSON files are formatted as one record per line so that pyspark can directly read them into DataFrames. All data manipulation for further analysis are done in SparkSQL. Major steps in this process will be explained in more details in following sections.

Cleaning Data

The data downloaded from OCDF has already undergone some cleaning and reorganization so the data is almost ready to use. Though there are still a few things that need to be tidied up. The data cleaning relies on usage of both R and python. As mentioned in the previous section, the original data is stored in DAT format. The read.delim function in R can import DAT files into R dataframes with ease.

Removing messy actor names

The data posted on OCDF has a few additional files than the original MovieLens data. One of them is movie_actors. The lists of actors in that table is rather complete. Some movies have over 50 actors associated to their IDs. Although some actors’ names are not stored in full in movie_actors table. All of these actors have names like John “Having a Long Nickname” Doe. Values of such names in the DAT file are incomplete strings and often end with ellipses that can easily hinder data importation. In order to eliminate these records and avoid getting errors when importing the DAT files, we deleted all lines containing these type of actor names using Vim command :g/\.\.\./d. Only 389 out of 231742 lines were deleted. Since the deleted actors only take up a small portion of the entire table and they usually have very low rankings (much larger than 20), this removal is not expected to have much of an impact on our analysis concerning actors.

Removing duplicated movies

One issue is that there are occasions where different movie IDs point to the same movie (as mentioned briefly in last section). Fortunately we still have the IMDb ID that can act as a unique identifier of a movie. This cleaning task can be done using dplyr. The dplyr package is a very powerful R package designed for data manipulation. In this step, we first use the group_by function to segment movies by their unique IMDb IDs, then use filter function to keep only the first row in each group. Package dplyr has a function row_number that serves as a hidden “column” for row indices, which is exactly what we need in this operation. The R command for this step looks like this:

df = df %>% group_by(imdbID) %>% filter(row_number() == 1)

Adding Data from API

This part is fairly straight-forward with the requests module. We only need to supply an API token and a valid IMDb ID to the url format to get the desired movie scores. The imdbID column in table movies is of numeric type, whereas a standard movie title ID takes the form “tt + a seven-digit integer”. Therefore in this step I used the format function to pad the imdbID strings with leading zeros to make them have a fixed width of 7. By calling requests.get with each unique IMDb ID, I was able to add the Metascore and IMDb user score to each movie in the cleaned table. Note that of all four scoring systems we are interested in, IMDb user score is the only one on a scale of 0-10. For the scores to be comparable, we multiplied IMDb user scores by 10 to put them on the same 0-100 scale as the other three scoring systems.

Joining Tables in Spark

As mentioned in Data Sources section, all the supporting tables like movie_actors can be joined to the main table by movieID column. Since we already removed duplicated movie records from the main table, there won’t be any redundant or misleading rows as long as we use INNER JOIN in the SQL commands. The main table is used in all analyses in this project, while other tables are joined when the need arises in each specific task. Because most of the analyses relies heavily on averaging and/or ranking review scores, we will leave the specific SQL operations to the next section.

Analysis and Visualization

Ranking Director/Actor Duos

It usually takes the combined effort of great directors and great actors to make extraordinary movies. In this analysis, the objective is to get rankings for director/actor duos based on their average movie scores.

First we need to construct a base table by joining tables movies, actors and directors. Because we mainly care about duos of director and leading actors, we add a condition to only use the rows in table actors that have a top 10 ranking. Here is the SQL command for joining the tables:

SELECT *
FROM movies AS m JOIN actors AS ma ON m.id = ma.movieID
JOIN directors AS md ON m.id = md.movieID
WHERE ma.ranking <= 10

The output of the above SQL query is then registered as table duo_base in SQLContext. Here we use Metascore as an example of getting the rankings. First we divide the rows of duo_base into groups, each of which is associated with a unique director/actor pair. We can use columns directorID and actorID to determine unique duos. Due to the fact that some movies don’t have Metascores (such movies have “N/A” in their Metascore field), we also need a WHERE clause to exclude these records when averaging scores. A duo have to work on at least three movies together to be considered on the lists. As a result, the SQL query for this task looks like this:

SELECT FIRST(directorName) AS directorName, FIRST(actorName) AS actorName,
AVG(Metascore) AS Metascore, COUNT(imdbID) AS count
FROM duo_base
WHERE Metascore != 'N/A'
GROUP BY directorID, actorID
HAVING count >= 3
ORDER BY Metascore DESC

In order to save some space, here we only display the top 20 duos based on Metascore and IMDb user score.

Top 20 director/actor duos based on Metascores
Director Actor Metascore No. of Movies
1 Francis Ford Coppola Robert Duvall 91.67 3
2 Peter Jackson Elijah Wood 91.33 3
3 Peter Jackson Viggo Mortensen 91.33 3
4 Peter Jackson Ian McKellen 91.33 3
5 Peter Jackson Sean Astin 91.33 3
6 Peter Jackson Billy Boyd 91.33 3
7 Francis Ford Coppola John Cazale 90.33 3
8 Jean-Luc Godard Jean-Luc Godard 87.33 3
9 Jean-Luc Godard Anna Karina 85.33 3
10 Martin Scorsese Joe Pesci 84.67 3
11 Martin Scorsese Robert De Niro 84.00 5
12 Kar Wai Wong Maggie Cheung 83.33 3
13 Jean-Pierre Dardenne Olivier Gourmet 82.75 4
14 Guillermo del Toro Doug Jones 82.67 3
15 Kar Wai Wong Tony Leung Chiu Wai 81.75 4
16 Francis Ford Coppola Diane Keaton 81.67 3
17 Francis Ford Coppola Al Pacino 81.67 3
18 Francis Ford Coppola Talia Shire 81.67 3
19 Woody Allen Diane Keaton 81.60 5
20 Steven Soderbergh Luis Guzman 81.33 3
Top 20 director/actor duos based on IMDb scores
Director Actor IMDb score No. of Movies
1 Francis Ford Coppola Robert Duvall 89.00 3
2 Peter Jackson Elijah Wood 88.00 3
3 Peter Jackson Viggo Mortensen 88.00 3
4 Peter Jackson Ian McKellen 88.00 3
5 Peter Jackson Sean Astin 88.00 3
6 Peter Jackson Billy Boyd 88.00 3
7 Francis Ford Coppola John Cazale 87.00 3
8 Christopher Nolan Christian Bale 86.00 3
9 Francis Ford Coppola Diane Keaton 86.00 3
10 Francis Ford Coppola Al Pacino 86.00 3
11 Francis Ford Coppola Talia Shire 86.00 3
12 Masaki Kobayashi Michiyo Aratama 85.00 4
13 Sergio Leone Clint Eastwood 84.00 3
14 Martin Scorsese Joe Pesci 83.67 3
15 Masaki Kobayashi Tatsuya Nakadai 83.43 7
16 Charles Chaplin Charles Chaplin 83.00 4
17 Kenji Mizoguchi Kikue Mori 82.67 3
18 Michael Apted Symon Basterfield 82.50 4
19 Michael Apted Bruce Balden 82.50 4
20 Michael Apted Jacqueline Bassett 82.50 4

We can see plenty of familiar names on both lists. Francis Ford Coppola and Robert Duvall appear at the top of both lists. Francis Ford Coppola even have five entries on both lists with the same five actors. This is largely due to the success of the Godfather franchise. The same goes for Peter Jackson. Because of the critically acclaimed Lord of the Rings trilogy, He has five entries on both lists as well and all of them are from Lord of the Rings trilogy. Besides these two directors, there are some other famous duos on the lists as well, for example, Martin Scorsese/Robert De Niro and Woody Allen/Diane Keaton. Another interesting duo is Charles Chaplin/Charles Chaplin. Even though this is not technically a duo but I decided to leave it in the list because it is quite a feat to have directed and acted in four movies averaging 83 on IMDb.

Finding Movies with Polarizing Scores

One of the first thing that piqued my interest in this project is that the movie Twin Peaks: Fire Walk with Me received a 72 on IMDb score but only got a 28 on Metascore. Even though the IMDb user score of a movie tends to be higher than its Metascore, a score difference of 44 is still astounding. Therefore in this section we will look into movies with a huge gap between their IMDb scores and Metascores. Note that we are only interested in movies with a much higher IMDb score, as a movie getting a much higher Metascore rarely happens.

Movies with the largest score differences

For the first step, we will simply compute the score difference and show movies with the 20 largest score differences. This only needs a straight-forward SQL query on the movies table and subtract column Metascore from column imdbRating.

Top 20 largest score differences
Title IMDb score Metascore Deference
1 Half Baked 67 16 51
2 Screwed 57 7 50
3 La tigre e la neve 72 22 50
4 Billy Madison 64 16 48
5 Highlander 72 24 48
6 I Am Sam 76 28 48
7 Kung Pow: Enter the Fist 62 14 48
8 Vulgar 53 5 48
9 Gummo 67 19 48
10 Tropa de Elite 81 33 48
11 The Butterfly Effect 77 30 47
12 Friday the 13th 65 19 46
13 8MM 65 19 46
14 Cocktail 58 12 46
15 The Life of David Gale 76 31 45
16 Somewhere in Time 73 29 44
17 Twin Peaks: Fire Walk with Me 72 28 44
18 The Beastmaster 62 18 44
19 The Keep 58 14 44
20 Bio-Dome 44 1 43

Many of the movies on Table 3 have an IMDb score over 70, which is usually a high enough score for a movie to be enjoyable. But they all get extremely low Metascores. Recall that Metascore is the weighted average of individual critic ratings. So a metascore as low as the ones in Table 2 means most critics find the movie horrible. In the next step we will dig a little deeper to find out what genres constantly receive polarizing scores.

Polarizing movies by genre

Assuming the table used by the last task is saved as gap, then we can join tables gap and genres by movieID. Here we arbitrarily set the threshold of score difference at 30, as we are mostly interested to see what genres tend to have exceedingly large score gaps. Below is the SQL command to count the number of movies having a score gap larger than 30 in each genre:

SELECT genres.genre AS Genre, COUNT(gap.id) AS Count
FROM gap JOIN genres ON gap.id = genres.movieID
WHERE gap.gap >= 30
GROUP BY genre
ORDER BY Count DESC

Figure 3 is a bar chart that illustrates the results.

Bar chart of movies count by genre

In Figure 3, the labels on the right hand side indicate how many movies in each genre have a score gap larger than 30. The labels on the left hand side are the percentages of movies that have a score gap larger than 30 in each genre. Movies without either an IMDb score or a Metascore are excluded from the denominators when computing these percentages.

From Figure 3, we can see that Comedy has a lot more polarizing movies than other genres. In fact, there are 183 movies that have a score difference larger than 30 (a movie can have multiple genres). This means sixty percent of those movies have comedy elements. This is not so far off from intuition. General audience usually judge a comedy movie by its ability to make people laugh, while critics consider far more standards than just decide whether or not the jokes are good.

Even though Drama is a close second to Comedy when looking at just the movie count, this doesn’t mean Drama films are prone to getting polarizing reviews as half the movies can be categorized as Drama. In order to investigate which genres are more likely to receive polarizing ratings, it is useful to look at the percentage of polarizing movies over the number of movies in each genre. According to the percentages, Crime, Comedy, Thriller, Action and Horror have a much higher chance than other genres to get polarizing scores. Similar to the discussion in section 4.1, these genres are usually entertaining for general audience to watch, but can have a harder time at getting praise from critics.

Challenges

The main challenge in this project was the data cleaning process. There were a few inconspicuous data entry errors that didn’t cause any script to break but silently altered the analysis results. For example, some actors’ names were written in very messy format and rendered multiple following rows unable to be read into data frames. Similarly, duplicated movie records initially caused many problems in ranking director/actor duos. Both of these issues were only discovered after I closely inspected the outputs and tried to get verification by manually searching movie information through both websites and API. Eventually they were handled using Vim and dplyr package respectively, as already explained in Data Manipulation Methods section.

Another major issue with the data is that the accuracy of Rotten Tomatoes data is questionable. The OCDF dataset was compiled in 2011, so the Rotten Tomatoes data was probably scraped during that time (Rotten Tomatoes data was not included in the original MovieLens dataset). A closer inspection at the data reveals that many movies have a very limited number of Rotten Tomatoes critics reviews. Therefore some movies have drastically different Tomatometer scores than they do today. For example, in our dataset, Batman & Robin only has one positive critics review and therefore has a perfect Tomatometer score at 100. However, its score on today’s Rotten Tomatoes website is merely 10 and it’s considered one of the worst movies of all time. There is no way to verify whether the data is completely false or it is simply because there weren’t enough reviews back then. Unfortunately, I couldn’t find a publicly available API to query up-to-date Rotten Tomatoes data for each movie. So eventually the analyses were carried out using the potentially faulty Rotten Tomatoes data. Furthermore, the last two analyses were performed on IMDb scores and Metascores only due to those analyses being more sensitive towards extreme scores, which happens more often in Rotten Tomato scoring system than IMDb scoring system.