Motivation
While it is fun to just go to a movie theatre and hope to be pleasantly surprised by a new release, many people prefer to go to movie review sites for recommendations. IMDb and Rotten Tomatoes are arguably two of the most popular websites where people can find reviews and ratings for nearly any movie. Both of these websites provide two sets of ratings: critics scores and audience scores. IMDb uses Metascore to represent critics opinions while Rotten Tomatoes uses Tomatometer to achieve the same goal. For a single movie, one can get four different scores just within these two websites. Therefore it can be confusing to some viewers as to which one to trust, or how to interpret the potentially large differences in scores.
This project aims to obtain insights regarding the interpretation and comparison among these different scoring systems through data analysis and visualization. One focus in this project would be to visualize the distributions of scores and see how they vary across the years and across different genres. The former can give us a temporal view of the movie qualities according to critics and viewers, while the latter will shed some light on whether or not particular genres tend to be favored by critics and/or viewers. This can give viewers a better understanding of where a score stands comparing to other movies in the same time or same genre, which is more informative than just looking at the sheer value of a score. Another objective of this project is to find out what factors into a movie getting vastly different scores from different sources. For example, Batman v Superman received a notoriously low Tomatometer score of 27%, which is almost 40 points than its audience score. It could be interesting to determine if this happens more often to movies with certain themes, or maybe even to certain directors. Finally, the project will also presents some rankings of actors and directors by averaging the scores of movies they appear in.
Data Sources
IMDb + Rotten Tomatoes Datasets from OCDF
The data can be found and downloaded from the Open Collaboration Data Factories website. There are 12 data tables in total, but not all of them will be used in this project. The datasets are a derivative of the MovieLens 10M dataset published by GroupLens research group, with some additional columns added through web scraping. Below is a more detailed description of the tables used in this project:
- movies: The main table, containing all ratings and review counts data from Rotten Tomatoes. There is an id column that links the main table to rest of the tables. Also there is an imdbID column that can be used to query IMDb ratings using API.
- movie_genres: This table links movie IDs to one or more genres from 20 different genres. Each movie averagely has two genres.
- movie_actors: Links movie IDs to actors appeared in those movies. Each actor is associated with a ranking variable, which is based on the order actors are listed on the movie’s IMDb page. Also each actor is associated with a unique id to avoid name confusions.
- movie_directors: Links movie IDs to directors. Like the actors table, each director has a unique id.
These tables cover information of 10197 movies from year 1903 to 2011 (Note: There are actually duplicated records in the original dataset. After removing them, the number of movies goes down to 9413.). However, years towards both ends of the time frame have much fewer movies than years in the middle. For the purpose of having a reasonable volume of movies for each year, we eliminate the years with fewer than 10 movies. As a result, the time frame is narrowed to from year 1930 to 2009, with 9357 movies in total.
OMDB API
The OMDB API is a RESTful web service to obtain movie information. It can be used to extract a wide range of variables just by providing a valid IMDb ID (for movie titles, it takes the format: tt + a seven-digit numeric id). The url request takes the form of “http://www.omdbapi.com/?apikey=xxxxxxxx&i=tt0031381”, where “tt0031381” is the IMDb ID of the movie Gone with the Wind. The requests return json strings that can be loaded into python and added as additional columns to the main table movies. In this case, most data we need are already covered by the OCDF dataset. We only need the OMDB API to obtain the Metascores and IMDb user scores. The API requests are made possible thanks to fact that the IMDb ID of each movie is readily available in the main table. By using OMDB API, we pulled the Metascore and IMDb user score for each of the 9357 movies and stored the data as new columns Metascore and imdbRating.
Data Manipulation Methods
In this section, we will go through the data manipulation I’ve done to prepare and process the data. Some of these steps are already mentioned in Data Sources section, and they will be explained in more details here.
General Workflow
As mentioned in the previous section, the original data is stored in DAT format. The read.delim
function in R can import DAT files into R dataframes with ease. Before the data is read into R, we first use Vim global command to delete some incorrectly written actor data to ensure there won’t be formatting errors when exporting the data to TSV or JSON files. After importing the downloaded data files in R, we cleared out duplicated movie records using dplyr
package. The resulting R dataframes were then written as TSV files and imported into python. In python, we added columns Metascore and imdbRating to the main table movies using pandas
module, requests
module and OMDB API. After the data cleaning, the data stored in pandas DataFrames
were exported as JSON files through the to_json
method provided by pandas
. The JSON files are formatted as one record per line so that pyspark can directly read them into DataFrames
. All data manipulation for further analysis are done in SparkSQL. Major steps in this process will be explained in more details in following sections.
Cleaning Data
The data downloaded from OCDF has already undergone some cleaning and reorganization so the data is almost ready to use. Though there are still a few things that need to be tidied up. The data cleaning relies on usage of both R and python. As mentioned in the previous section, the original data is stored in DAT format. The read.delim
function in R can import DAT files into R dataframes with ease.
Removing messy actor names
The data posted on OCDF has a few additional files than the original MovieLens data. One of them is movie_actors. The lists of actors in that table is rather complete. Some movies have over 50 actors associated to their IDs. Although some actors’ names are not stored in full in movie_actors table. All of these actors have names like John “Having a Long Nickname” Doe. Values of such names in the DAT file are incomplete strings and often end with ellipses that can easily hinder data importation. In order to eliminate these records and avoid getting errors when importing the DAT files, we deleted all lines containing these type of actor names using Vim command :g/\.\.\./d
. Only 389 out of 231742 lines were deleted. Since the deleted actors only take up a small portion of the entire table and they usually have very low rankings (much larger than 20), this removal is not expected to have much of an impact on our analysis concerning actors.
Removing duplicated movies
One issue is that there are occasions where different movie IDs point to the same movie (as mentioned briefly in last section). Fortunately we still have the IMDb ID that can act as a unique identifier of a movie. This cleaning task can be done using dplyr
. The dplyr
package is a very powerful R package designed for data manipulation. In this step, we first use the group_by
function to segment movies by their unique IMDb IDs, then use filter
function to keep only the first row in each group. Package dplyr
has a function row_number
that serves as a hidden “column” for row indices, which is exactly what we need in this operation. The R command for this step looks like this:
df = df %>% group_by(imdbID) %>% filter(row_number() == 1)
Adding Data from API
This part is fairly straight-forward with the requests
module. We only need to supply an API token and a valid IMDb ID to the url format to get the desired movie scores. The imdbID column in table movies is of numeric type, whereas a standard movie title ID takes the form “tt + a seven-digit integer”. Therefore in this step I used the format
function to pad the imdbID strings with leading zeros to make them have a fixed width of 7. By calling requests.get
with each unique IMDb ID, I was able to add the Metascore and IMDb user score to each movie in the cleaned table. Note that of all four scoring systems we are interested in, IMDb user score is the only one on a scale of 0-10. For the scores to be comparable, we multiplied IMDb user scores by 10 to put them on the same 0-100 scale as the other three scoring systems.
Joining Tables in Spark
As mentioned in Data Sources section, all the supporting tables like movie_actors can be joined to the main table by movieID column. Since we already removed duplicated movie records from the main table, there won’t be any redundant or misleading rows as long as we use INNER JOIN
in the SQL commands. The main table is used in all analyses in this project, while other tables are joined when the need arises in each specific task. Because most of the analyses relies heavily on averaging and/or ranking review scores, we will leave the specific SQL operations to the next section.
Analysis and Visualization
Exploratory Analysis on Temporal Trends and Genres Comparison
The first analysis task is to get a first look at the distribution of these different scores. Histograms and density plots are great tools for visualizing distributions. But given that we have the data of around ten thousand movies across 20 different genres and a large time frame, it would be more meaningful to segment the data by year of release and by genre. The score trends can reveal the progression of movie qualities as perceived by different scoring systems, while the score differences across genres can indicate whether or not certain scoring platforms have preferences on particular type of movies.
Trends of average scores by year
This analysis can be done using the main table alone, as year is already a column in it. The JSON file of movies table is loaded in pyspark and registered as a table in SQLContext
. We then run a SQL query on that table to group movies by year and get the average scores from year 1930 to 2009 (recall that we discarded the years with too few movies). The SQL command for this analysis is as follows:
SELECT year AS Year, AVG(rtAllCriticsScore) AS rtAllCritScore, AVG(rtAudienceScore) AS rtAudScore,
AVG(imdbRating) AS imdbRating, AVG(Metascore) AS Metascore, COUNT(id) AS count
FROM movies
GROUP BY year
ORDER BY year
Once we get the output, we can use pyspark command df.toPandas().to_csv()
to convert the output table to pandas DataFrame
and then save to a TSV file. All visualizations in this project are generated using R package ggplot2
. Figure 1 is a plot showing the score trends from 1930 to 2009. It is worth mentioning that Metascore is usually much more selective than the other three, which means the number of movies having a Metascore tend to be smaller than the total number of movies in the dataset. This is especially true for older movies. And that is why the trend plot in Metascore panel is constantly “broken” before 1950.
There are a few interesting findings we can observe from Figure 1:
Both Metascore and Tomatometer display a very obvious declining trend. Granted that the people who put together the original dataset might have sampled more well-known movies for movies released in earlier days, but we do have a rich amount of data after year 1970 with at lease close to 100 movies per year, and the declining rate still seems to stay the same. Especially for the most recent three years 2007 to 2009, both plots indicate a sharp drop in the average scores. So at least according to movie critics, the average movie quality has been steadily declining over the last few decades.
On the other hand, the Rotten Tomatoes audience scores indicate an opposite view on the progression of movie qualities. The average audience scores fluctuate quite a lot prior to 1980s. After that, the scores have been steadily increasing until 2009. Note that Rotten Tomatoes was officially launched in 2000. So the majority of average users are younger people who might not be familiar with older classics that came out before 1970s. It is also not a secret that, unlike critics, general movie viewers are better at enjoying the entertainment of a movie than its artistic values. So this is likely the reason why the critics scores and audience scores on Rotten Tomatoes have completely different trends.
Another thing worth mentioning in Figure 1 is that the IMDb score trend is almost flat in the entire time frame, with a barely visible declining pattern. Since the IMDb user score of a movie is calculated simply by averaging all user ratings, it is usually the least variable one out of four scores. It is quite common for a movie’s IMDb user score to end up somewhere around 6 to 7.5.
Distributions across different genres
To prepare data for this analysis, we need to join the movie_genres table to movies table. The SQL command is as follows:
SELECT m.rtAllCriticsScore, m.rtAudienceScore, m.imdbRating, m.Metascore, g.genre AS Genre
FROM movies AS m JOIN genres AS g ON m.id = g.movieID
Again, we saved the table as TSV file and loaded it in R. There are 20 genres in the dataset. While most of the genres each cover a good portion of our movies pool, the genre “Short” has only been assigned to one movie. Ignoring that movie leaves us with 19 genres and over 9000 movies. Boxplot is a nice option to visualize the distributions of 19 genres all at once. Figure 2 is a multipanel plot showing the boxplot of each scoring system within its own panel.
The boxplots reveal some interesting insights regarding the differences among genres:
Horror is almost unanimously agreed by all four scoring systems as the worst genre. Action films are a close second.
All scoring systems placed Film-Noir as the best genre, only in the case of Metascore that Documentary seems to be closely tied with Film-Noir. Documentary has a high score distribution in Tomatometer and Rotten Tomatoes audience score as well.
Compared to other scores, Tomatometer appears to be particularly fond of Western genre. The quartiles of Western film scores are averagely high at best in the other three scoring systems, while according to Tomatometer, Western has the third highest median and the second highest third quartile of all 19 genres.
Besides those findings, Figure 2 also show us plenty of traits of each scoring systems:
The distributions of Tomatometer scores tend to cover almost the entire scale from 0 to 100 for all genres. All 19 genres have movies that reach 100 on Tomatometer. Film-Noir even has its third quartile reaching 100, which means 25% of movies in Film-Noir genres are given a perfect rating on Tomatometer. The Tomatometer scores also have the widest interquartile ranges (IQR). Many genres’ IQRs cover half of the entire scale. On one hand, this means Tomatometer make use of the entire scale to evaluate movie qualities. On the other hand, because Tomatometer score is essentially the percentage of critics that give the movie a positive review, the score can easily goes to either end of the scale. After all, it is much easier for critics to agree on a binary choice than to agree on a specific score from 0 to 100.
Similar to the flat pattern in its trend plot, the distribution of IMDb user score is almost invariable across different genres. IMDb user score also has the narrowest interquartile ranges of all four scores. It is also the only scoring system that does not have movies that reach 100. Personally I don’t think these are bad things. Like I mentioned earlier, IMDb user score is simply the average of all submitted user ratings. This makes IMDb score more neutral than the others. General audience can get a sense of what average users feel about a movie, instead of being scared off by a terrifyingly low score or overhyped by a score close to 100.
Ranking Director/Actor Duos
It usually takes the combined effort of great directors and great actors to make extraordinary movies. In this analysis, the objective is to get rankings for director/actor duos based on their average movie scores.
First we need to construct a base table by joining tables movies, actors and directors. Because we mainly care about duos of director and leading actors, we add a condition to only use the rows in table actors that have a top 10 ranking. Here is the SQL command for joining the tables:
SELECT *
FROM movies AS m JOIN actors AS ma ON m.id = ma.movieID
JOIN directors AS md ON m.id = md.movieID
WHERE ma.ranking <= 10
The output of the above SQL query is then registered as table duo_base in SQLContext. Here we use Metascore as an example of getting the rankings. First we divide the rows of duo_base into groups, each of which is associated with a unique director/actor pair. We can use columns directorID and actorID to determine unique duos. Due to the fact that some movies don’t have Metascores (such movies have “N/A” in their Metascore field), we also need a WHERE
clause to exclude these records when averaging scores. A duo have to work on at least three movies together to be considered on the lists. As a result, the SQL query for this task looks like this:
SELECT FIRST(directorName) AS directorName, FIRST(actorName) AS actorName,
AVG(Metascore) AS Metascore, COUNT(imdbID) AS count
FROM duo_base
WHERE Metascore != 'N/A'
GROUP BY directorID, actorID
HAVING count >= 3
ORDER BY Metascore DESC
In order to save some space, here we only display the top 20 duos based on Metascore and IMDb user score.
Director | Actor | Metascore | No. of Movies | |
---|---|---|---|---|
1 | Francis Ford Coppola | Robert Duvall | 91.67 | 3 |
2 | Peter Jackson | Elijah Wood | 91.33 | 3 |
3 | Peter Jackson | Viggo Mortensen | 91.33 | 3 |
4 | Peter Jackson | Ian McKellen | 91.33 | 3 |
5 | Peter Jackson | Sean Astin | 91.33 | 3 |
6 | Peter Jackson | Billy Boyd | 91.33 | 3 |
7 | Francis Ford Coppola | John Cazale | 90.33 | 3 |
8 | Jean-Luc Godard | Jean-Luc Godard | 87.33 | 3 |
9 | Jean-Luc Godard | Anna Karina | 85.33 | 3 |
10 | Martin Scorsese | Joe Pesci | 84.67 | 3 |
11 | Martin Scorsese | Robert De Niro | 84.00 | 5 |
12 | Kar Wai Wong | Maggie Cheung | 83.33 | 3 |
13 | Jean-Pierre Dardenne | Olivier Gourmet | 82.75 | 4 |
14 | Guillermo del Toro | Doug Jones | 82.67 | 3 |
15 | Kar Wai Wong | Tony Leung Chiu Wai | 81.75 | 4 |
16 | Francis Ford Coppola | Diane Keaton | 81.67 | 3 |
17 | Francis Ford Coppola | Al Pacino | 81.67 | 3 |
18 | Francis Ford Coppola | Talia Shire | 81.67 | 3 |
19 | Woody Allen | Diane Keaton | 81.60 | 5 |
20 | Steven Soderbergh | Luis Guzman | 81.33 | 3 |
Director | Actor | IMDb score | No. of Movies | |
---|---|---|---|---|
1 | Francis Ford Coppola | Robert Duvall | 89.00 | 3 |
2 | Peter Jackson | Elijah Wood | 88.00 | 3 |
3 | Peter Jackson | Viggo Mortensen | 88.00 | 3 |
4 | Peter Jackson | Ian McKellen | 88.00 | 3 |
5 | Peter Jackson | Sean Astin | 88.00 | 3 |
6 | Peter Jackson | Billy Boyd | 88.00 | 3 |
7 | Francis Ford Coppola | John Cazale | 87.00 | 3 |
8 | Christopher Nolan | Christian Bale | 86.00 | 3 |
9 | Francis Ford Coppola | Diane Keaton | 86.00 | 3 |
10 | Francis Ford Coppola | Al Pacino | 86.00 | 3 |
11 | Francis Ford Coppola | Talia Shire | 86.00 | 3 |
12 | Masaki Kobayashi | Michiyo Aratama | 85.00 | 4 |
13 | Sergio Leone | Clint Eastwood | 84.00 | 3 |
14 | Martin Scorsese | Joe Pesci | 83.67 | 3 |
15 | Masaki Kobayashi | Tatsuya Nakadai | 83.43 | 7 |
16 | Charles Chaplin | Charles Chaplin | 83.00 | 4 |
17 | Kenji Mizoguchi | Kikue Mori | 82.67 | 3 |
18 | Michael Apted | Symon Basterfield | 82.50 | 4 |
19 | Michael Apted | Bruce Balden | 82.50 | 4 |
20 | Michael Apted | Jacqueline Bassett | 82.50 | 4 |
We can see plenty of familiar names on both lists. Francis Ford Coppola and Robert Duvall appear at the top of both lists. Francis Ford Coppola even have five entries on both lists with the same five actors. This is largely due to the success of the Godfather franchise. The same goes for Peter Jackson. Because of the critically acclaimed Lord of the Rings trilogy, He has five entries on both lists as well and all of them are from Lord of the Rings trilogy. Besides these two directors, there are some other famous duos on the lists as well, for example, Martin Scorsese/Robert De Niro and Woody Allen/Diane Keaton. Another interesting duo is Charles Chaplin/Charles Chaplin. Even though this is not technically a duo but I decided to leave it in the list because it is quite a feat to have directed and acted in four movies averaging 83 on IMDb.
Finding Movies with Polarizing Scores
One of the first thing that piqued my interest in this project is that the movie Twin Peaks: Fire Walk with Me received a 72 on IMDb score but only got a 28 on Metascore. Even though the IMDb user score of a movie tends to be higher than its Metascore, a score difference of 44 is still astounding. Therefore in this section we will look into movies with a huge gap between their IMDb scores and Metascores. Note that we are only interested in movies with a much higher IMDb score, as a movie getting a much higher Metascore rarely happens.
Movies with the largest score differences
For the first step, we will simply compute the score difference and show movies with the 20 largest score differences. This only needs a straight-forward SQL query on the movies table and subtract column Metascore from column imdbRating.
Title | IMDb score | Metascore | Deference | |
---|---|---|---|---|
1 | Half Baked | 67 | 16 | 51 |
2 | Screwed | 57 | 7 | 50 |
3 | La tigre e la neve | 72 | 22 | 50 |
4 | Billy Madison | 64 | 16 | 48 |
5 | Highlander | 72 | 24 | 48 |
6 | I Am Sam | 76 | 28 | 48 |
7 | Kung Pow: Enter the Fist | 62 | 14 | 48 |
8 | Vulgar | 53 | 5 | 48 |
9 | Gummo | 67 | 19 | 48 |
10 | Tropa de Elite | 81 | 33 | 48 |
11 | The Butterfly Effect | 77 | 30 | 47 |
12 | Friday the 13th | 65 | 19 | 46 |
13 | 8MM | 65 | 19 | 46 |
14 | Cocktail | 58 | 12 | 46 |
15 | The Life of David Gale | 76 | 31 | 45 |
16 | Somewhere in Time | 73 | 29 | 44 |
17 | Twin Peaks: Fire Walk with Me | 72 | 28 | 44 |
18 | The Beastmaster | 62 | 18 | 44 |
19 | The Keep | 58 | 14 | 44 |
20 | Bio-Dome | 44 | 1 | 43 |
Many of the movies on Table 3 have an IMDb score over 70, which is usually a high enough score for a movie to be enjoyable. But they all get extremely low Metascores. Recall that Metascore is the weighted average of individual critic ratings. So a metascore as low as the ones in Table 2 means most critics find the movie horrible. In the next step we will dig a little deeper to find out what genres constantly receive polarizing scores.
Polarizing movies by genre
Assuming the table used by the last task is saved as gap
, then we can join tables gap
and genres
by movieID. Here we arbitrarily set the threshold of score difference at 30, as we are mostly interested to see what genres tend to have exceedingly large score gaps. Below is the SQL command to count the number of movies having a score gap larger than 30 in each genre:
SELECT genres.genre AS Genre, COUNT(gap.id) AS Count
FROM gap JOIN genres ON gap.id = genres.movieID
WHERE gap.gap >= 30
GROUP BY genre
ORDER BY Count DESC
Figure 3 is a bar chart that illustrates the results.
In Figure 3, the labels on the right hand side indicate how many movies in each genre have a score gap larger than 30. The labels on the left hand side are the percentages of movies that have a score gap larger than 30 in each genre. Movies without either an IMDb score or a Metascore are excluded from the denominators when computing these percentages.
From Figure 3, we can see that Comedy has a lot more polarizing movies than other genres. In fact, there are 183 movies that have a score difference larger than 30 (a movie can have multiple genres). This means sixty percent of those movies have comedy elements. This is not so far off from intuition. General audience usually judge a comedy movie by its ability to make people laugh, while critics consider far more standards than just decide whether or not the jokes are good.
Even though Drama is a close second to Comedy when looking at just the movie count, this doesn’t mean Drama films are prone to getting polarizing reviews as half the movies can be categorized as Drama. In order to investigate which genres are more likely to receive polarizing ratings, it is useful to look at the percentage of polarizing movies over the number of movies in each genre. According to the percentages, Crime, Comedy, Thriller, Action and Horror have a much higher chance than other genres to get polarizing scores. Similar to the discussion in section 4.1, these genres are usually entertaining for general audience to watch, but can have a harder time at getting praise from critics.
Challenges
The main challenge in this project was the data cleaning process. There were a few inconspicuous data entry errors that didn’t cause any script to break but silently altered the analysis results. For example, some actors’ names were written in very messy format and rendered multiple following rows unable to be read into data frames. Similarly, duplicated movie records initially caused many problems in ranking director/actor duos. Both of these issues were only discovered after I closely inspected the outputs and tried to get verification by manually searching movie information through both websites and API. Eventually they were handled using Vim and dplyr
package respectively, as already explained in Data Manipulation Methods section.
Another major issue with the data is that the accuracy of Rotten Tomatoes data is questionable. The OCDF dataset was compiled in 2011, so the Rotten Tomatoes data was probably scraped during that time (Rotten Tomatoes data was not included in the original MovieLens dataset). A closer inspection at the data reveals that many movies have a very limited number of Rotten Tomatoes critics reviews. Therefore some movies have drastically different Tomatometer scores than they do today. For example, in our dataset, Batman & Robin only has one positive critics review and therefore has a perfect Tomatometer score at 100. However, its score on today’s Rotten Tomatoes website is merely 10 and it’s considered one of the worst movies of all time. There is no way to verify whether the data is completely false or it is simply because there weren’t enough reviews back then. Unfortunately, I couldn’t find a publicly available API to query up-to-date Rotten Tomatoes data for each movie. So eventually the analyses were carried out using the potentially faulty Rotten Tomatoes data. Furthermore, the last two analyses were performed on IMDb scores and Metascores only due to those analyses being more sensitive towards extreme scores, which happens more often in Rotten Tomato scoring system than IMDb scoring system.