Starting Up
I used following packages for this assignment:
library(httr)
library(tidyr)
library(plyr)I applied and obtained a NY Times API key. The key was saved in the system environment using Sys.setenv() function, which can be retrieved using Sys.getenv("nytimes_apikey"). For this assignment, I am interested in the top_stories API. My goal is to collect all data from the sections in top_stories to create a data.frame object through the API.
Wrapping Function for JSON List
Before I started grabbing data from the NY Times API, I first built a function that can help me process the JSON files. From the last assignment, I noticed there are some common steps that can be wrapped in one function.
Below function takes a json list object and perform transformation to turn it into a data.frame object. This will be handy when I need to process the json file.
readNYTimesApi <- function(jsonlist){
jsonlist %>%
lapply(unlist) %>% # Unlist each list elements inside the jsonlist
lapply(t) %>% # Transpose each vector element to row element
lapply(data.frame, stringsAsFactor=F) %>% # Turn row element into data.frame
do.call(rbind.fill, .) # Bind elements together row-wise.
}Top Stories
The API documentation for the NY Times top_stories is here: http://developer.nytimes.com/top_stories_v2.json#/Documentation/GET/%7Bsection%7D.%7Bformat%7D.
The documentation explains that the url for the GET function should be in this format: /{section}.{format}, where the “format” should be .json and “section” can include the any of the following: “home, opinion, world, national, politics, upshot, nyregion, business, technology, science, health, sports, arts, books, movies,theater, sundayreview, fashion, tmagazine, food, travel, magazine, realestate, automobiles, obituaries, insider”.
sections <- "home, opinion, world, national, politics, upshot, nyregion, business, technology, science, health, sports, arts, books, movies, theater, sundayreview, fashion, tmagazine, food, travel, magazine, realestate, automobiles, obituaries, insider" %>%
strsplit(split=", ") %>%
unlist()Below codes construct a vector containing the query strings for all of the sections.
topstories <- "https://api.nytimes.com/svc/topstories/v2/"
urls <- paste(topstories, sections, ".json?",sep="")
urls## [1] "https://api.nytimes.com/svc/topstories/v2/home.json?"
## [2] "https://api.nytimes.com/svc/topstories/v2/opinion.json?"
## [3] "https://api.nytimes.com/svc/topstories/v2/world.json?"
## [4] "https://api.nytimes.com/svc/topstories/v2/national.json?"
## [5] "https://api.nytimes.com/svc/topstories/v2/politics.json?"
## [6] "https://api.nytimes.com/svc/topstories/v2/upshot.json?"
## [7] "https://api.nytimes.com/svc/topstories/v2/nyregion.json?"
## [8] "https://api.nytimes.com/svc/topstories/v2/business.json?"
## [9] "https://api.nytimes.com/svc/topstories/v2/technology.json?"
## [10] "https://api.nytimes.com/svc/topstories/v2/science.json?"
## [11] "https://api.nytimes.com/svc/topstories/v2/health.json?"
## [12] "https://api.nytimes.com/svc/topstories/v2/sports.json?"
## [13] "https://api.nytimes.com/svc/topstories/v2/arts.json?"
## [14] "https://api.nytimes.com/svc/topstories/v2/books.json?"
## [15] "https://api.nytimes.com/svc/topstories/v2/movies.json?"
## [16] "https://api.nytimes.com/svc/topstories/v2/theater.json?"
## [17] "https://api.nytimes.com/svc/topstories/v2/sundayreview.json?"
## [18] "https://api.nytimes.com/svc/topstories/v2/fashion.json?"
## [19] "https://api.nytimes.com/svc/topstories/v2/tmagazine.json?"
## [20] "https://api.nytimes.com/svc/topstories/v2/food.json?"
## [21] "https://api.nytimes.com/svc/topstories/v2/travel.json?"
## [22] "https://api.nytimes.com/svc/topstories/v2/magazine.json?"
## [23] "https://api.nytimes.com/svc/topstories/v2/realestate.json?"
## [24] "https://api.nytimes.com/svc/topstories/v2/automobiles.json?"
## [25] "https://api.nytimes.com/svc/topstories/v2/obituaries.json?"
## [26] "https://api.nytimes.com/svc/topstories/v2/insider.json?"
This includes all 26 sections in top_stories.
Below codes will try to grab all the top stories in these sections, using two lapply to pipe the GET and content function together.
nytimes <- urls %>%
lapply(GET, add_headers("api-key"=Sys.getenv("nytimes_apikey"))) %>%
lapply(content, "parse")However, I found that this will not work. Took me some times to find out why. It turns out that the NY Times API server imposes a rate limit of 1,000 calls per day, 5 calls per second. When I tried to use lapply, the computer made the GET calls too quickly and was exceeding the 5 calls per second limit. So I would get results for the first few calls and then would get rejected for the rest.
To resolve this issue, I had to use a for-loop and imposed a time suspension inside the loop using Sys.sleep function.
nytimes <- list()
for (i in 1:length(urls)){
Sys.sleep(0.1)
nytimes[[i]] <- GET(urls[i], add_headers("api-key"=Sys.getenv("nytimes_apikey")))
}
(num_lst <- length(nytimes))## [1] 26
This will pause the GET call every time for 1/10 of a second, so that I don’t go over the limit. I was able to successfully execute all 26 GET calls. I can now parse the json object to lists using content function.
nytimes <- lapply(nytimes, content, "parse")The result is a list containing 26 elements. Each element corresponds to a section in the top_stories API and is a list itself.
Exploring The List
Let’s see what are the contents of these lists:
lapply(nytimes, names)## [[1]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[2]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[3]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[4]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[5]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[6]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[7]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[8]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[9]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[10]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[11]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[12]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[13]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[14]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[15]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[16]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[17]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[18]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[19]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[20]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[21]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[22]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[23]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[24]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[25]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
##
## [[26]]
## [1] "status" "copyright" "section" "last_updated"
## [5] "num_results" "results"
So they all have the same structure. I’m particularly interested in the “results” element. Let’s apply the readNYTimesApi functions created above to see the contents for one of the section.
temp <- nytimes[[1]]$results
temp <- readNYTimesApi(temp)
tempYes. This is exactly what I was looking for. It is a data.frame object containing all necessary information for the top stories in this section.
Compile All Top News
I now parse all the 26 elements and construct a big data.frame object, which contains top news covering all 26 sections.
alltopstories <- nytimes %>%
lapply("[[", "results") %>% # Grab the "results" element of each element
lapply(readNYTimesApi) %>% # Apply the readNYTimesApi function to each element
do.call(rbind.fill, .) # Bind elements together row-wise
write.csv(alltopstories, "alltopstories.csv", row.names = F)
alltopstoriesThe csv file can be found here: https://raw.githubusercontent.com/Tyllis/Data607/master/alltopstories.csv