Access to data has become commonplace thanks to the Internet today. But this applies only to data that is readable by humans. Machine-readable data are rare and usually not openly accessible. It is possible to parse data from html pages. However, this is cumbersome and slow. In most cases, this is also illegal and is expressly mentioned not to do, in the Use Condition.
Fortunately, there are exceptions:
OMDbAPI provides access to a machine-readable film and media database.
OMDbAPI uses JSON for data exchange.
Here an example on how to access this data with Gnu R.
This tutorial will help you to write some R code to have access on the OMDbAPI database. The first thing you might want to implement is a title search function.
library("jsonlite")
searchMovie <- function(exp, url_mirror="http://www.omdbapi.com") {
jsonURL <- url(paste0(url_mirror, "/?s=", exp))
movie <- fromJSON(readLines(jsonURL, warn=FALSE), simplifyVector=FALSE)
close(jsonURL)
return(movie[[1]])
}
The expression /?s="peace"
is appended to the url. http://www.omdbapi.com/?s="peace"
.
This will return the result in a JSON object. Since a JSON object is read, a JSON parser comes to use. In this example it is »jsonlite«.
# one word search
searchResult <- searchMovie("peace")
Title | Year | imdbID | Type | |
---|---|---|---|---|
1 | Superman IV: The Quest for Peace | 1987 | tt0094074 | movie |
2 | War and Peace | 1956 | tt0049934 | movie |
3 | War and Peace | 1966 | tt0063794 | movie |
4 | Peace, Love, & Misunderstanding | 2011 | tt1649780 | movie |
5 | War and Peace | 2007 | tt0495055 | movie |
6 | Peace on Earth | 1939 | tt0031790 | movie |
7 | Rest in Peace, Mrs. Columbo | 1990 | tt0097088 | episode |
8 | Metal Gear Solid: Peace Walker | 2010 | tt1531061 | game |
9 | A Separate Peace | 2004 | tt0328400 | movie |
10 | Peace, Propaganda & the Promised Land | 2004 | tt0428959 | movie |
The function searchMovie
returns a list of all hits.
Containing information like:
The imdbID is a unique identifier that is needed to get detailed movie information.
The next function movieDetail
gets some Detailed information like rating or runtime in minutes.
movieDetail <- function(imdbId, url_mirror="http://www.omdbapi.com") {
getMovieDetail <- function(ID, url=url_mirror)
{
jsonURL <- url(paste0(url_mirror, "/?i=", ID))
details <- fromJSON(readLines(jsonURL, warn=FALSE))
close(jsonURL)
return(details)
}
return(lapply(imdbId, getMovieDetail))
}
This time the expression /?i="tt...."
is appended to the url.
To get detailed information about the movie, the imdbID (Identifier) is needed. For this example the imdbID from the searchResult
is used. Any imdbID can be used. This function could be used to get the top 250 rated movies just by adding a vector with corresponding imdbID numbers.
# Using the function.
movies <- movieDetail(searchResult[2:3,3])
Title | Year | imdbID | imdbRating | imdbVotes | Runtime | |
---|---|---|---|---|---|---|
1 | War and Peace | 1956 | tt0049934 | 6.80 | 5314 | 208 |
2 | War and Peace | 1966 | tt0063794 | 7.70 | 3934 | 427 |
The function movieDetail
returns a data structure with Details like:
Not all attributes are listet in this example. Some coercing from a list to a dataframe have to be done.