Analyzing Movie Scores on IMDb and Rotten Tomatoes

Xinghui Song

Motivation

While it is fun to just go to a movie theatre and hope to be pleasantly surprised by a new release, many people prefer to go to movie review sites for recommendations. IMDb and Rotten Tomatoes are arguably two of the most popular websites where people can find reviews and ratings for nearly any movie. Both of these websites provide two sets of ratings: critics scores and audience scores. IMDb uses Metascore to represent critics opinions while Rotten Tomatoes uses Tomatometer to achieve the same goal. For a single movie, one can get four different scores just within these two websites. Therefore it can be confusing to some viewers as to which one to trust, or how to interpret the potentially large differences in scores.

This project aims to obtain insights regarding the interpretation and comparison among these different scoring systems through data analysis and visualization. One focus in this project would be to visualize the distributions of scores and see how they vary across the years and across different genres. The former can give us a temporal view of the movie qualities according to critics and viewers, while the latter will shed some light on whether or not particular genres tend to be favored by critics and/or viewers. This can give viewers a better understanding of where a score stands comparing to other movies in the same time or same genre, which is more informative than just looking at the sheer value of a score. Another objective of this project is to find out what factors into a movie getting vastly different scores from different sources. For example, Batman v Superman received a notoriously low Tomatometer score of 27%, which is almost 40 points than its audience score. It could be interesting to determine if this happens more often to movies with certain themes, or maybe even to certain directors. Finally, the project will also presents some rankings of actors and directors by averaging the scores of movies they appear in.

Data Sources

IMDb + Rotten Tomatoes Datasets from OCDF

The data can be found and downloaded from the Open Collaboration Data Factories website. There are 12 data tables in total, but not all of them will be used in this project. The datasets are a derivative of the MovieLens 10M dataset published by GroupLens research group, with some additional columns added through web scraping. Below is a more detailed description of the tables used in this project:

  • movies: The main table, containing all ratings and review counts data from Rotten Tomatoes. There is an id column that links the main table to rest of the tables. Also there is an imdbID column that can be used to query IMDb ratings using API.
  • movie_genres: This table links movie IDs to one or more genres from 20 different genres. Each movie averagely has two genres.
  • movie_actors: Links movie IDs to actors appeared in those movies. Each actor is associated with a ranking variable, which is based on the order actors are listed on the movie’s IMDb page. Also each actor is associated with a unique id to avoid name confusions.
  • movie_directors: Links movie IDs to directors. Like the actors table, each director has a unique id.

These tables cover information of 10197 movies from year 1903 to 2011 (Note: There are actually duplicated records in the original dataset. After removing them, the number of movies goes down to 9413.). However, years towards both ends of the time frame have much fewer movies than years in the middle. For the purpose of having a reasonable volume of movies for each year, we eliminate the years with fewer than 10 movies. As a result, the time frame is narrowed to from year 1930 to 2009, with 9357 movies in total.

OMDB API

The OMDB API is a RESTful web service to obtain movie information. It can be used to extract a wide range of variables just by providing a valid IMDb ID (for movie titles, it takes the format: tt + a seven-digit numeric id). The url request takes the form of “http://www.omdbapi.com/?apikey=xxxxxxxx&i=tt0031381”, where “tt0031381” is the IMDb ID of the movie Gone with the Wind. The requests return json strings that can be loaded into python and added as additional columns to the main table movies. In this case, most data we need are already covered by the OCDF dataset. We only need the OMDB API to obtain the Metascores and IMDb user scores. The API requests are made possible thanks to fact that the IMDb ID of each movie is readily available in the main table. By using OMDB API, we pulled the Metascore and IMDb user score for each of the 9357 movies and stored the data as new columns Metascore and imdbRating.

Data Manipulation Methods

In this section, we will go through the data manipulation I’ve done to prepare and process the data. Some of these steps are already mentioned in Data Sources section, and they will be explained in more details here.

General Workflow

As mentioned in the previous section, the original data is stored in DAT format. The read.delim function in R can import DAT files into R dataframes with ease. Before the data is read into R, we first use Vim global command to delete some incorrectly written actor data to ensure there won’t be formatting errors when exporting the data to TSV or JSON files. After importing the downloaded data files in R, we cleared out duplicated movie records using dplyr package. The resulting R dataframes were then written as TSV files and imported into python. In python, we added columns Metascore and imdbRating to the main table movies using pandas module, requests module and OMDB API. After the data cleaning, the data stored in pandas DataFrames were exported as JSON files through the to_json method provided by pandas. The JSON files are formatted as one record per line so that pyspark can directly read them into DataFrames. All data manipulation for further analysis are done in SparkSQL. Major steps in this process will be explained in more details in following sections.

Cleaning Data

The data downloaded from OCDF has already undergone some cleaning and reorganization so the data is almost ready to use. Though there are still a few things that need to be tidied up. The data cleaning relies on usage of both R and python. As mentioned in the previous section, the original data is stored in DAT format. The read.delim function in R can import DAT files into R dataframes with ease.

Removing messy actor names

The data posted on OCDF has a few additional files than the original MovieLens data. One of them is movie_actors. The lists of actors in that table is rather complete. Some movies have over 50 actors associated to their IDs. Although some actors’ names are not stored in full in movie_actors table. All of these actors have names like John “Having a Long Nickname” Doe. Values of such names in the DAT file are incomplete strings and often end with ellipses that can easily hinder data importation. In order to eliminate these records and avoid getting errors when importing the DAT files, we deleted all lines containing these type of actor names using Vim command :g/\.\.\./d. Only 389 out of 231742 lines were deleted. Since the deleted actors only take up a small portion of the entire table and they usually have very low rankings (much larger than 20), this removal is not expected to have much of an impact on our analysis concerning actors.

Removing duplicated movies

One issue is that there are occasions where different movie IDs point to the same movie (as mentioned briefly in last section). Fortunately we still have the IMDb ID that can act as a unique identifier of a movie. This cleaning task can be done using dplyr. The dplyr package is a very powerful R package designed for data manipulation. In this step, we first use the group_by function to segment movies by their unique IMDb IDs, then use filter function to keep only the first row in each group. Package dplyr has a function row_number that serves as a hidden “column” for row indices, which is exactly what we need in this operation. The R command for this step looks like this:

df = df %>% group_by(imdbID) %>% filter(row_number() == 1)

Adding Data from API

This part is fairly straight-forward with the requests module. We only need to supply an API token and a valid IMDb ID to the url format to get the desired movie scores. The imdbID column in table movies is of numeric type, whereas a standard movie title ID takes the form “tt + a seven-digit integer”. Therefore in this step I used the format function to pad the imdbID strings with leading zeros to make them have a fixed width of 7. By calling requests.get with each unique IMDb ID, I was able to add the Metascore and IMDb user score to each movie in the cleaned table. Note that of all four scoring systems we are interested in, IMDb user score is the only one on a scale of 0-10. For the scores to be comparable, we multiplied IMDb user scores by 10 to put them on the same 0-100 scale as the other three scoring systems.

Joining Tables in Spark

As mentioned in Data Sources section, all the supporting tables like movie_actors can be joined to the main table by movieID column. Since we already removed duplicated movie records from the main table, there won’t be any redundant or misleading rows as long as we use INNER JOIN in the SQL commands. The main table is used in all analyses in this project, while other tables are joined when the need arises in each specific task. Because most of the analyses relies heavily on averaging and/or ranking review scores, we will leave the specific SQL operations to the next section.

Analysis and Visualization