For this weeks assignment, we will be working primarily with API’s. As an avid fan of movies, I of course picked the Movie Reviews API:
As the API’s all appear to utilize a URI, it will be interesting to work with. There are 4 basic URI’s that are available to us, but the 2 that I will be using for this project are ass follows:
Search By Keyword
http://api.nytimes.com/svc/movies/{version}/reviews/search[.response_format]?[optional-param1=value1]&[...]&api-key={your-API-key}Review and NYT Critics’ Picks
http://api.nytimes.com/svc/movies/{version}/reviews/{resource-type}[.response_format]?[optional-param1=value1]&[...]&api-key={your-API-key}As these URIs use a GET request, the primary tool that we can use for this project is the JSONlite package. This can pull the created URL directly using the fromJSON function.
library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
library(stringr)
library(tidyr)
From here, the easiest way to pull information would be to “build” a URI using the above format. Here is a sample URI that we pulled using the search key “The Martian”
url<- "http://api.nytimes.com/svc/movies/v2/reviews/search.json?query='The+Martian'&api-key=e2d581c1a5059550bc8711ca7e9bc86a:17:73348668"
url_data <- fromJSON(url)
data <- url_data$results
colnames(data)
## [1] "nyt_movie_id" "display_title" "sort_name"
## [4] "mpaa_rating" "critics_pick" "thousand_best"
## [7] "byline" "headline" "capsule_review"
## [10] "summary_short" "publication_date" "opening_date"
## [13] "dvd_release_date" "date_updated" "seo_name"
## [16] "link" "related_urls" "multimedia"
This shows just a generic pull and the data that it extracts, and the headers that it creates.
Though the assignment this week appeared to be rather open ended (and truthfully I was a little preturbed by it) I decided to have a little fun, and rather then pull a vast dataset, I create a simple and generic keyword search for this particular API. Furthermore, the data for this particular API is not very useful statstically (I don’t really read the NY Times, but I assumed the reviewers gave a “rating”, but it turns out they simple list it as a “crtics-pick” or not) I looked the column data, and found the only data I would be particular interested in a query would be The Movie Title, the MPAA rating, critics pick, top 1000s, opening date, and the dvd release date. So, I created a simple function:
Keyword_Search <- function(keyword){
keyword <-unlist(strsplit(toString(keyword)," "))
#Loop to create a standard "New_Key" function in the required URL format
new_key <- character(length =0)
for(i in 1:length(keyword)){
new_key <- paste0(new_key,keyword[i],"+")
}
str_sub(new_key, -1,-1) <- ""
#Cut out URL to for the Query '' left to make Query more restrictive
URI_1 <-"http://api.nytimes.com/svc/movies/v2/reviews/search.json?query='"
URI_key <-"'&api-key=e2d581c1a5059550bc8711ca7e9bc86a:17:73348668"
# Combining the separate sets into one URL and extracting JSON
data<- fromJSON(paste0(URI_1,new_key,URI_key))
data_frame<-data$results
#Cleaning
names(data_frame) <- c("a","Movie","c","MPAA_Rating", "Critics_Pick","Thousand_Best","d","e","f","g","h","Opening_Date","Dvd_Release","i","j","k","l","m")
data_frame$a <- NULL
data_frame$c <- NULL
data_frame$d <- NULL
data_frame$e <- NULL
data_frame$f <- NULL
data_frame$g <- NULL
data_frame$h <- NULL
data_frame$i <- NULL
data_frame$j <- NULL
data_frame$k <- NULL
data_frame$l <- NULL
data_frame$m <- NULL
data_frame
}
Keyword_Search("Apollo")
## Movie MPAA_Rating Critics_Pick Thousand_Best Opening_Date
## 1 Apollo 18 PG13 0 0 2011-09-02
## 2 House of Pleasures <NA> 0 0 2011-11-25
## 3 Apollo 13 PG 1 1 1995-06-30
## 4 Purple Rain R 0 0 1984-07-27
## 5 Broadway Danny Rose <NA> 1 0 1984-01-01
## Dvd_Release
## 1 <NA>
## 2 <NA>
## 3 2006-08-22
## 4 <NA>
## 5 2001-11-06
Keyword_Search("Saving Private Ryan")
## Movie MPAA_Rating Critics_Pick Thousand_Best
## 1 Saving Private Ryan R 1 1
## 2 Saving Private Perez PG13 0 0
## 3 Private School R 0 0
## 4 Woman in Gold PG13 0 0
## 5 Goosebumps PG 0 0
## 6 Mississippi Grind R 1 0
## 7 The Young Kieslowski R 0 0
## 8 Jack Ryan: Shadow Recruit PG13 0 0
## 9 Kirk Cameron's Saving Christmas PG 0 0
## 10 The Admiral: Roaring Currents <NA> 0 0
## 11 Cabin Fever: Patient Zero NR 0 0
## 12 Reasonable Doubt R 0 0
## 13 Jackie & Ryan PG13 0 0
## 14 Catch Hell 0 0
## 15 Breathe In R 0 0
## 16 Devil's Knot NR 0 0
## 17 Escape Plan R 0 0
## 18 R.I.P.D. PG13 0 0
## 19 Code Black NR 1 0
## 20 Good Ol' Freda PG 0 0
## Opening_Date Dvd_Release
## 1 1998-07-24 1999-11-02
## 2 2011-09-02 <NA>
## 3 1983-07-29 <NA>
## 4 2015-04-03 <NA>
## 5 2015-10-16 <NA>
## 6 2015-09-25 2015-08-18
## 7 2015-07-24 <NA>
## 8 2014-01-17 <NA>
## 9 2014-11-14 <NA>
## 10 2014-08-15 <NA>
## 11 2014-08-01 <NA>
## 12 2014-01-17 <NA>
## 13 2015-07-03 <NA>
## 14 2014-10-10 <NA>
## 15 2013-03-28 <NA>
## 16 2014-05-09 <NA>
## 17 2013-10-18 <NA>
## 18 2013-07-19 <NA>
## 19 2014-06-20 2015-02-24
## 20 2013-09-13 <NA>
The Query of the API is not the greatest, appreantly it uses an index, and also the “or” function, so any search with multiple words produces multiple results. As it doesn’t search exact matches, it tends to pull more results than needed. Fortunately, it does find the most likely matches and list them first.
Anyway, that is a simple code, I didn’t put in any errors or stops for the function (ie. if you put in a non string value). Those will all get an automated error anyway from R console, so they seemed surpurflous.