Movie Database API Query
Introduction
The Open Movie Database (OMDb) is a public online database and provides API to use the web service for movie information.
In this project, I will go through my setup including a script to render the rmarkdown, and the tools and libraries I use in this project. Then I will develop a set of helper functions including a wrapper function to make the API calls easy with returned data in a data frame format. After checking all the functions work as expected, I will then use it to obtain some movie information from OMDb web service and perform some basic EDA including numerical and graphical summaries on the data.
Setup 0: Codes to Render this Markdown Page
This code chunk (don’t execute) shows how to use
RMarkdown::render() to render this rmarkdown document.
Setup 1: Confidential Keys
It’s a good practice not to include any login confidential
information like passwords or access keys explicitly in codes. Here, we
use keyring to store the API key.
At the first time to run this notebook, uncomment the following line and execute it. RStudio will ask for your API key and store it. This needs to be done only once.
Setup 2: (Install and) Load Packages
We will use the following packages in this project:
here: enables easy file referencing and builds file paths in a OS-independent waykeyring: provides OS-independent ‘API’ to access system’s credential storestats: loads this before loadingtidyverseto avoid masking sometidyversefunctionstidyverse: includes collections of useful packages likedplyr(data manipulation),tidyr(tidying data),ggplots(creating graphs), etc.rlang: deals with missing arguments in our helper functionslubridate: handle date and datetime data typeglue: offers interpreted string literals for easy creation of dynamic messages and labelsscales: formats and labels scales nicely for better visualizationhttr: performs HTTP web APIsjsonliet: parses data stored in JSON formatGGally: plots a very nice scatterplot matrix
In addition, the pacman package provides handy tools to
manage R packages (install, update, load and unload). We use its
p_laod() instead of libarary() to load the
packages listed above.
# load the `pacman` package
# if it's not yet installed, install and load it
if (!require("pacman")) utils::install.packages("pacman", dependencies = TRUE)## Loading required package: pacman
Helper Functions
The first helper function is a negated version of the built-in
%in% operator. This makes our codes cleaner and easier to
read when we check if an element is absent.
To make the OMDb API easy to use, we define some helper functions and wrap them up in a user-friendly function to perform a query. There are mainly two different API queries: (1) inquiry by a IMDb ID or a movie title, and (2) inquiry with a movie title search. To perform the queries, we define the following helper functions and an easy-to-use wrapper to do API calls.
OMDb_setup_query(): sets up the query parametersOMDB_query(): performs an API call and check the results (status)OMDb_parse_movie(): parse inquiry results of a movie by its IMDb ID or title (with/without other optional parameters)OMDb_parse_search()parse search results (with/without using other optional parameters)OMDb_movie(): a wrapper function to do a movie query or search (with or without other optional parameters)
Note: The variable Year returned in the query response
is character but might not be a single year. In some cases (especially
for media type: series), it can be a character string like
2015-2020 or 2015-. Therefore,
Year variable will be kept as is. Instead, a new date
variable Year_satrt will be created to get the beginning
year of the span.
OMDb_setup_query()
Create query parameters in a list of key-value pairs:
Arguments:
args: a list of all other arguments passed in from the wrapper functionOMDb_movie()by: method of API query. Needs to be one of the API supported methodsvalue: value to inquery or search forpage: (optional) specified page of search resultsverbose: a boolean switch to print out the queryReturned Value: query parameters in a list of key-value pairs
# a helper function to create query parameters in a list of key-value pairs
OMDb_setup_query <- function(args, by = c("id", "title", "search"), value, page, verbose = T, ...) {
# verify query method argument `by` to be one of the supported methods
by <- match.arg(by)
# assign default query value if it's missing in argument (for demo and test purpose)
if(is_missing(value)){
value <- c("tt1285016", "Batman", "Batman")[by == c("id", "title", "search")]
}
# assign default page value (only needed when using search method)
if(by == "search") {
args$page <- ifelse(is_missing(page), "1", as.character(page))
}
# remove `by` and `value` from the list
# also remove argument `r` (if exists) because this function only supports JSON
args[c("by", "value", "r")] <- NULL
# construct the query as a list of query parameters and the api key
query0 <-
# key-value pair from the query method `by` and query value `value`
switch(by,
id = list(i = value),
title = list(t = value),
search = list(s = value)) %>%
# append the rest of query parameters
append(args)
# print out the query
if(verbose){
print(glue(
"Query: {paste(names(query0), query0, sep = '=', collapse = ',')}"
))
}
# add the api key
query <- query0 %>%
# add the api key
append(list(apiKey = key_get("RStudio Keyring Secrets", "OMDb")))
return(query)
}OMDB_query()
A wrapper of httr::Get() to send a query request,
convert and parse the response data, and check the execution status.
Arguments:
query: query parameters in a list of key-value pairsReturned Value: parsed JSON content data from the API response
# a low-level helper function to send a query request and check the execution status
OMDB_query <- function(query){
# send API query request
rawdata <- GET(url = "http://www.omdbapi.com/", query = query)
# check if the request is successful
# if not, stop and print the HTTP status code
if(!between(rawdata$status_code, 200, 299)) {
stop(glue(
"The HTTP GET request is not successful. HTTP status code: {rawdata$status_code}"
))
}
# parse the JSON content data
parsed <- fromJSON(rawToChar(rawdata$content))
# remove any empty list
for(i in names(parsed)){
if(length(parsed[[i]]) == 0) parsed[[i]] <- NA
}
# failed response
if(parsed$Response == "False"){
stop(glue(
"The HTTR Get request is successful but the respose of the query has a false return ",
"with an error message from API server: {parsed$Error}"
))
}
return(parsed)
}OMDb_parse_movie()
Parse info of a movie and save in a data frame after data type
conversion for numeric data (Runtim in minutes,
imdbRating, BoxOffice in dollar amount, etc)
and date time data (Released date).
Arguments:
parsed: a parsed JSON content data from a query of a movieReturned Value: a data frame with the information of the movie
# a helper function to parse info of a movie and save in a data frame
OMDb_parse_movie <- function(parsed){
# convert the data into appropriate data type and save in a data frame
df <- parsed %>%
as_tibble() %>%
# convert columns to numeric type
mutate(across(matches(
c("Runtime", "Metascore", "imdbRating", "imdbVotes", "BoxOffice", "totalSeasons")),
~parse_number(.x, na = c("", "NA", "N/A")))) %>%
# convert columns to datetime type
mutate(across(matches(
c("Released", "DVD")),
~parse_date(.x, na = c("", "NA", "N/A"), format = "%d %b %Y"))) %>%
mutate(Year_start = parse_number(Year))
return(df)
}OMDb_parse_search()
Parse search results and save in a data frame. If only asks for a specified page of the search results, return that page, otherwise, retrieve the remaining pages and return all results.
Arguments:
parsed: a parsed JSON content data from a query of a moviequery: query parameters in a list of key-value pairsargs: a list of all other arguments passed in from the wrapper functionOMDb_movie()Returned Value: a data frame with the search results
# a helper function to parse search results and save in a data frame
OMDb_parse_search <- function(parsed, query, args){
df <- parsed$Search %>% as_tibble()
# how many pages of the search results
totalPages <- ceiling(as.numeric(parsed$totalResults)/10)
# if there are more than one page and page argument is not specified
# get the remaining pages
if (("page" %notin% names(args)) && totalPages > 1){
for (i in 2:totalPages) {
query$page <- as.character(i)
df_nextpage <- OMDB_query(query)$Search %>% as_tibble()
df <- bind_rows(df, df_nextpage)
}
}
# add the total results and sort the data by Type and Year
df <- df %>%
mutate(
Year_start = parse_number(Year),
totalResults = parsed$totalResults
) %>%
arrange(Title, Type, Year)
}OMDb_movie()
A wrapper function to do a movie query or search (with or without other optional parameters).
Arguments:
by: method of API query. Needs to be one of the API supported methods...: (optional) query parameters includingtype,y(year),page(page number of search return),plot, etc. More info of the API parameters can be found here.Returned Value: a data frame of what the query returns
# a user-friendly wrapper function to perform an API query
OMDb_movie <- function(by = c("id", "title", "search"), ...) {
# save all arguments as a named list
args <- as.list(match.call())[-1]
# set up query string
query <- OMDb_setup_query(args, by, ...)
parsed <- OMDB_query(query)
# convert parsed data into a data frame based on query method
if (by == "search"){
# search returns a list of movies
df <- OMDb_parse_search(parsed, query, args)
} else {
# when using search by id and title, returns info of the movie
df <- OMDb_parse_movie(parsed)
}
return(df)
}Test and Examples of Using OMDb_movie()
In this section, we will test the wrapper function and demonstrate its usages.
Single-movie queries (by IMDb ID)
When doing a single-movie query with an IMDb ID (using OMDb API’s
i parameter), the API only returns an exact match if there
exists, or an false response with an error message if not exact match.
The successful returned data can be a movie, a TV series or a game
title. Different media types have different variables (data columns) in
the returns and the wrapper function should be able to take care of
these differences in the background.
## Query: i=tt0078346
## Query: i=tt0106057
## Query: i=tt0772202
Note: If a query returns a movie, the variable ratings
is a date frame with two columns: Ratings$Source and
RatingsValue. In a tibble, this is a list column (a column
that stores a data frame in it). The data frame printout might look
differently depending on the rendering options.
Single-movie queries (by Title)
When doing a query by a title (using OMDb API’s t
parameter), the API call again returns one result but its behavior is
different from queries by IMDb ID in the following ways:
- If there is no exact match, it returns one movie with partially matched title (not clearly how it decides which to pick).
- If there is only one exact match, it turns that movie.
- If there are more than one exact match, it still returns one movie
(again, not clearly how it decides which to pick). We can specify an
additional search parameter like
y(year) to get other matched movies.
## Query: t=Batman
## Query: t=Batman,y=1966
We can also specify the parameter plot to get different
versions of the plot:
## Query: t=Batman,y=1966,type=series
batman4 <- OMDb_movie(by = "title", value = "Batman", y = 1966, type = "series", plot = "full") # series## Query: t=Batman,y=1966,type=series,plot=full
## Short plot:
## The Caped Crusader and his young ward battle evildoers in Gotham City.
##
## Full plot:
## Wealthy entrepreneur Bruce Wayne (Adam West) and his ward Dick Grayson (Burt Ward) lead a double life: they are actually the crime-fighting duo Batman and Robin. A secret Batpole in the Wayne mansion leads to the Batcave, where Police Commissioner Gordon (Neil Hamilton) summons the Dynamic Duo on the Batphone with the latest emergency threatening Gotham City. Racing to the scene of the crime in the jet-powered Batmobile, Batman and Robin must (with the help of their trusty utility-belts) thwart the efforts of a rogues gallery of flamboyant archvillains, including The Joker (Cesar Romero), The Penguin (Burgess Meredith), The Riddler (Frank Gorshin) and the Catwoman (Julie Newmar and Eartha Kitt).
Search with a Partially Matched Value
Searches using OMDb API’s s parameter return a list of
media (movies/series/games) as long as they have exact or partial match
to the search value. Without specifying page parameter, our
helper function will make multiple API calls to retrieve all of the
results page by page and return all results in a data frame.
## Query: s=Batman,page=1
If the page parameter is specified, then it will only
return the results on the page (10 results on a page):
## Query: s=Batman,page=1
## Query: s=Batman,page=2
Tests with Bad Parameters
It’s always a good idea to test how our functions handle some bad inputs (query parameters) by users. In this section, we briefly test the following three cases.
Bad (Invalid) IMDBb ID
If we have an invalid IMDb ID, we get an error message from the respond.
Note: I have to turn the evaluation option off or the rendering function won’t work.
Bad (Makeup) Title
It seems this API always tries it best and returns a “partially” matching result even when you use a very weird title to search for:
## Query: t=@!#2@#
Bad Search Value
Because there is no exact match and are too many partial matched results, the API call return a failure response (although the HTTR request is successful) with an error message (from API server) saying that there are too many results.
Note: I have to turn the evaluation option off or the rendering function won’t work.
EDA
Since we have already done a search to get a list of all Batman-related title, let’s take a look at the data.
print(glue(
"The Batman search returns {nrow(batman_list)} results ",
"in {length(unique(batman_list$Type))} types of media ",
"including {batman_list$Type %>% unique() %>% toString()} ",
"from {min(batman_list$Year_start)} to {max(batman_list$Year_start)} ",
"(planned to be released)."
))## The Batman search returns 552 results in 3 types of media including movie, series, game from 1943 to 2025 (planned to be released).
##
## game movie series
## 46 451 55
# bar plot of media types
batman_list %>%
ggplot(aes(x = Type, fill = Type)) +
geom_bar() +
labs(
title = "Comparison of Number of Batman Movies/Series/Game Titles",
subtitle = glue("From {min(batman_list$Year_start)} to {max(batman_list$Year_start)}"),
x = "Media Type",
y = "Count"
)# histogram
batman_list %>% ggplot(aes(Year_start, fill = Type)) +
geom_histogram(position = "dodge2", binwidth = 1) +
labs(
title = "Histogram of Batman Movies/Series/Game Titles by Year",
subtitle = glue("From {min(batman_list$Year_start)} to {max(batman_list$Year_start)}"),
x = "Released Year",
y = "Number of Released Titles"
)I think it will be interesting to look at the movies of DC’s another big IP: Superman.
## Query: s=Superman,page=1
print(glue(
"The SuperBatman search returns {nrow(superman_list)} results ",
"in {length(unique(superman_list$Type))} types of media ",
"including {superman_list$Type %>% unique() %>% toString()} ",
"from {min(superman_list$Year_start)} to {max(superman_list$Year_start)} ",
"(planned to be released)."
))## The SuperBatman search returns 316 results in 3 types of media including movie, series, game from 1919 to 2025 (planned to be released).
##
## game movie series
## 11 282 23
We join these two data sets and compare them in a two-way contingency table:
DC_heroes <- bind_rows(
batman_list %>% mutate(Hero = "Batman"),
superman_list %>% mutate(Hero = "Superman")
)
DC_heroes##
## Batman Superman
## game 46 11
## movie 451 282
## series 55 23
# faceted scatter plots of counts by year
DC_heroes %>% ggplot(aes(Year_start, color = Hero)) +
geom_freqpoly(binwidth = 1) +
facet_grid(rows = vars(Type), scales = "free") +
labs(
title = "Comparison of Number of Movies/Series/Game Titles by Year",
subtitle = "Batman vs. Superman",
x = "Released Year",
y = "Number of Released Titles"
)From the faceted plot above, we can see that Batman have more titles produced in all three media types. Among the them TV series show strong periodic patterns than the other two media types.
Next, we will only focus on movies and only those with box office
records. For each movie, we will call the API wrapper function
OMDb_movie() with its IMDb ID to get its information
including box office records. To reduce the number of API calls, those
movies in DC_heroes without poster urls are unlikely to be
played in movie theaters and thus will be excluded in our next API
queries.
Note: I have to reduce the number of API calls because it’s so easy to reach the API query daily limit.
In case we need to do a reduced size of study, we canI limit the
number of queries by sampling some in the DC_heroes using
slice_sample(prop = 0.3) and in hope that the returned data
will have enough Batman and Superman movies with box office records for
the later EDA.
To get full search for EDA, set slice_sample(prop = 1)
or uncomment the line in the following code chunk.
# use API to get movie data for each movies
df_DC_raw <- DC_heroes %>%
# only movies with posters
filter(Type == "movie", Poster != "N/A") %>%
#slice_sample(prop = 0.3) %>%
select(Hero, imdbID) %>%
# convert to a named vector of IMDb ID
(function(x)(set_names(x$imdbID, x$Hero)))() %>%
# map through the IMDb ID to call API for queries of each movie
map_dfr(~OMDb_movie(by = "id", value = ., verbose = F), .id = "Hero")# side-by-side box plots of run time
df_DC_raw %>%
ggplot(aes(Hero, Runtime)) +
geom_boxplot(na.rm = TRUE) +
geom_jitter(aes(color = Rated), width = 0.1, na.rm = TRUE) +
labs(
title = "Runtime of All Batman and Super Movies",
subtitle = "First Look: Raw data",
x = "Superhero",
y = "Runtime [min]"
)There are 421 data rows from the query responses. Roughly speaking, Superman movies have longer run time. It is noticed that many of them are very short and are not rated by Motion Picture Association (MPA). These short and unrated movies are not likely played in movie theater and have box office records.
In order to make the following analysis more reasonable, I will only
consider those with MPA ratings and also remove the repeating data rows
due to flattening of the list column Ratings. Two new
variables are also created:
BO_per_min: Box office dollars per run time in minutespopularity: Popularity scaled defined as a product of IMDb Rating score and number of votes
df_DC <- df_DC_raw %>%
# remove repeat rows due to the flattening of list column
select(-Ratings) %>%
# filter out non-MPA rated films
filter(Rated %in% c("G", "PG", "PG-13", "R")) %>%
mutate(
Rated = factor(Rated, levels = c("R", "PG-13", "PG", "G")),
BO_per_min = BoxOffice/Runtime,
popularity = imdbRating * imdbVotes
)
# scatter plot of run time by year
df_DC %>%
ggplot(aes(Released, Runtime, color = Hero)) +
geom_point(na.rm = TRUE) +
facet_wrap(vars(Rated)) +
labs(
title = "Runtime of Batman and Super Movies by Year",
subtitle = "Films with MPA Ratings",
x = "Year",
y = "Runtime [min]"
)##
## R PG-13 PG G Sum
## Batman 12 54 16 1 83
## Superman 2 24 22 1 49
## Sum 14 78 38 2 132
The data set now has 132 movies with 83 Batman movies and 49 movies. Most of them are rated PG and PG-13. Batman movies have more rated R films probably because its story is darker and more violent. From the graph, it clearly shows that Superman movies on average have longer run time than the Batman movies. In the past decades, there is a trend that the movie run time are getting longer in R and PG-13 rated films.
Now let’s take at look at the box office:
# scatter plot of run time by year
df_DC %>%
ggplot(aes(Runtime, BoxOffice, color = Hero)) +
geom_point(na.rm = TRUE) +
geom_smooth(formula = y ~ x, method = "lm", se = F, na.rm = T) +
labs(
title = "Box Office of Batman and Super Movies by Run Time",
subtitle = "Films with MPA Ratings",
x = "Runtime [min]",
y = "Box Office [$]"
) +
scale_y_continuous(labels = label_dollar(suffix = " M", scale = 10^-6))Long movies tend to have bigger box office amount! There is a strong
correlation between BoxOffice and Runtime.
They have a large positive correlation coefficient 0.8310683. Below is a
side-by-side numeric summaries of the box office amount per run time (in
million dollars per minute):
# Numeric summaries on box office amount
df_DC %>%
mutate(
BoxOffice = BoxOffice / 10^6,
BO_per_min = BO_per_min / 10^6
) %>%
drop_na(BoxOffice, BO_per_min) %>%
group_by(Hero) %>%
summarise(
P75 = quantile(BO_per_min, probs = 0.75),
Mean = mean(BO_per_min),
P25 = quantile(BO_per_min, probs = 0.25),
Std = sd(BO_per_min)
)On average, every minute of the Batman movie brings in more box office amount than the Superman movies.
In addition to Runtime, it’s worthy to take a look at
how other variables like imdbRating and
imdbVotes correlate with the box office. A scatterplot
matrix comes handy to visualize their relationship.
# scatterplot matrix with correlation coefficients
df_DC %>%
select(Hero, BoxOffice, Runtime, imdbRating, imdbVotes) %>%
drop_na() %>%
ggpairs(
aes(color = Hero, alpha = 0.2),
axisLabels = "none"
) +
labs(
title = "Scatterplot Matrix with Correlation Coefficients"
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The number of votes on IMDb also has a strong correlation with the box office.
At last, we will plot a scatter plot to show the popularity score we defined earlier is a strong indicator on the box office.
# scatter plot to show the correlation between box office and the popularity score
df_DC %>%
ggplot(aes(popularity, BoxOffice, color = Hero)) +
geom_point(na.rm = TRUE) +
labs(
title = "Box Office of Batman and Superman Movies vs. Popularity Score",
subtitle = "Films with MPA Ratings",
x = "Popularity Score [Voting number times ratings]",
y = "Box Office [$]"
) +
scale_x_continuous(labels = number_format(scale = 10^-6)) +
scale_y_continuous(labels = label_dollar(suffix = " M", scale = 10^-6))