DATA607 Week 9 Assignment

Starting Up

I used following packages for this assignment:

library(httr)
library(tidyr)
library(plyr)

I applied and obtained a NY Times API key. The key was saved in the system environment using Sys.setenv() function, which can be retrieved using Sys.getenv("nytimes_apikey"). For this assignment, I am interested in the top_stories API. My goal is to collect all data from the sections in top_stories to create a data.frame object through the API.

Wrapping Function for JSON List

Before I started grabbing data from the NY Times API, I first built a function that can help me process the JSON files. From the last assignment, I noticed there are some common steps that can be wrapped in one function.

Below function takes a json list object and perform transformation to turn it into a data.frame object. This will be handy when I need to process the json file.

readNYTimesApi <- function(jsonlist){
  jsonlist %>%  
  lapply(unlist) %>%       # Unlist each list elements inside the jsonlist  
  lapply(t) %>%            # Transpose each vector element to row element
  lapply(data.frame, stringsAsFactor=F) %>%  # Turn row element into data.frame 
  do.call(rbind.fill, .)   # Bind elements together row-wise.
}

Top Stories

The API documentation for the NY Times top_stories is here: http://developer.nytimes.com/top_stories_v2.json#/Documentation/GET/%7Bsection%7D.%7Bformat%7D.

The documentation explains that the url for the GET function should be in this format: /{section}.{format}, where the “format” should be .json and “section” can include the any of the following: “home, opinion, world, national, politics, upshot, nyregion, business, technology, science, health, sports, arts, books, movies,theater, sundayreview, fashion, tmagazine, food, travel, magazine, realestate, automobiles, obituaries, insider”.

sections <- "home, opinion, world, national, politics, upshot, nyregion, business, technology, science, health, sports, arts, books, movies, theater, sundayreview, fashion, tmagazine, food, travel, magazine, realestate, automobiles, obituaries, insider" %>% 
  strsplit(split=", ") %>% 
  unlist()

Below codes construct a vector containing the query strings for all of the sections.

topstories <- "https://api.nytimes.com/svc/topstories/v2/"
urls <- paste(topstories, sections, ".json?",sep="")
urls

##  [1] "https://api.nytimes.com/svc/topstories/v2/home.json?"        
##  [2] "https://api.nytimes.com/svc/topstories/v2/opinion.json?"     
##  [3] "https://api.nytimes.com/svc/topstories/v2/world.json?"       
##  [4] "https://api.nytimes.com/svc/topstories/v2/national.json?"    
##  [5] "https://api.nytimes.com/svc/topstories/v2/politics.json?"    
##  [6] "https://api.nytimes.com/svc/topstories/v2/upshot.json?"      
##  [7] "https://api.nytimes.com/svc/topstories/v2/nyregion.json?"    
##  [8] "https://api.nytimes.com/svc/topstories/v2/business.json?"    
##  [9] "https://api.nytimes.com/svc/topstories/v2/technology.json?"  
## [10] "https://api.nytimes.com/svc/topstories/v2/science.json?"     
## [11] "https://api.nytimes.com/svc/topstories/v2/health.json?"      
## [12] "https://api.nytimes.com/svc/topstories/v2/sports.json?"      
## [13] "https://api.nytimes.com/svc/topstories/v2/arts.json?"        
## [14] "https://api.nytimes.com/svc/topstories/v2/books.json?"       
## [15] "https://api.nytimes.com/svc/topstories/v2/movies.json?"      
## [16] "https://api.nytimes.com/svc/topstories/v2/theater.json?"     
## [17] "https://api.nytimes.com/svc/topstories/v2/sundayreview.json?"
## [18] "https://api.nytimes.com/svc/topstories/v2/fashion.json?"     
## [19] "https://api.nytimes.com/svc/topstories/v2/tmagazine.json?"   
## [20] "https://api.nytimes.com/svc/topstories/v2/food.json?"        
## [21] "https://api.nytimes.com/svc/topstories/v2/travel.json?"      
## [22] "https://api.nytimes.com/svc/topstories/v2/magazine.json?"    
## [23] "https://api.nytimes.com/svc/topstories/v2/realestate.json?"  
## [24] "https://api.nytimes.com/svc/topstories/v2/automobiles.json?" 
## [25] "https://api.nytimes.com/svc/topstories/v2/obituaries.json?"  
## [26] "https://api.nytimes.com/svc/topstories/v2/insider.json?"

This includes all 26 sections in top_stories.

Below codes will try to grab all the top stories in these sections, using two lapply to pipe the GET and content function together.

nytimes <- urls %>% 
  lapply(GET, add_headers("api-key"=Sys.getenv("nytimes_apikey"))) %>% 
  lapply(content, "parse")

However, I found that this will not work. Took me some times to find out why. It turns out that the NY Times API server imposes a rate limit of 1,000 calls per day, 5 calls per second. When I tried to use lapply, the computer made the GET calls too quickly and was exceeding the 5 calls per second limit. So I would get results for the first few calls and then would get rejected for the rest.

To resolve this issue, I had to use a for-loop and imposed a time suspension inside the loop using Sys.sleep function.

nytimes <- list()
for (i in 1:length(urls)){
  Sys.sleep(0.1)
  nytimes[[i]] <- GET(urls[i], add_headers("api-key"=Sys.getenv("nytimes_apikey")))
}
(num_lst <- length(nytimes))

## [1] 26

This will pause the GET call every time for 1/10 of a second, so that I don’t go over the limit. I was able to successfully execute all 26 GET calls. I can now parse the json object to lists using content function.

nytimes <- lapply(nytimes, content, "parse")

The result is a list containing 26 elements. Each element corresponds to a section in the top_stories API and is a list itself.

Exploring The List

Let’s see what are the contents of these lists:

lapply(nytimes, names)

## [[1]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[2]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[3]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[4]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[5]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[6]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[7]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[8]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[9]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[10]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[11]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[12]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[13]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[14]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[15]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[16]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[17]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[18]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[19]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[20]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[21]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[22]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[23]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[24]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[25]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"     
## 
## [[26]]
## [1] "status"       "copyright"    "section"      "last_updated"
## [5] "num_results"  "results"

So they all have the same structure. I’m particularly interested in the “results” element. Let’s apply the readNYTimesApi functions created above to see the contents for one of the section.

temp <- nytimes[[1]]$results
temp <- readNYTimesApi(temp)
temp

Yes. This is exactly what I was looking for. It is a data.frame object containing all necessary information for the top stories in this section.

Compile All Top News

I now parse all the 26 elements and construct a big data.frame object, which contains top news covering all 26 sections.

alltopstories <- nytimes %>% 
  lapply("[[", "results") %>%  # Grab the "results" element of each element
  lapply(readNYTimesApi) %>%   # Apply the readNYTimesApi function to each element
  do.call(rbind.fill, .)       # Bind elements together row-wise
write.csv(alltopstories, "alltopstories.csv", row.names = F)
alltopstories

The csv file can be found here: https://raw.githubusercontent.com/Tyllis/Data607/master/alltopstories.csv