Data 607 Spring 2019 HW 9

Assignment:

The New York Times web site provides a rich set of APIs, as described here: http://developer.nytimes.com/docs
You’ll need to start by signing up for an API key.
Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it to an R dataframe.

library(httr)
library(jsonlite)
library(tidyr)
library(lubridate)

Initial experimentation with httr

# Use GET to send an API request to the article search console of NYT.
url_dreamers<- paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=dreamers&api-key=",nyt_key, sep="")
dreamer1 <- GET(url_dreamers, accept_json())
# check for error (TRUE if above 400)
http_error(dreamer1)

## [1] FALSE

# take a look at what was fetched - results not included bc contains key
dreamer1

# Attempt to convert parsed response of json in the form of nested lists into a dataframe by extracting columns and using dplyr and base R.
# However, looking at details, there are a ton of columns and it looks too complicated, so look for another way to do this. 
details <- content(dreamer1, as="parsed")
# details$response$docs

Found below tutorial and decided to use jsonlite instead of httr

Reference: http://www.storybench.org/working-with-the-new-york-times-api-in-r/

# Search for articles on Dreamers, registered undocumented individuals, given the recent political spotlight
dreamers <- fromJSON(url_dreamers, flatten=TRUE) %>% data.frame()

# Take a look at the columns
colnames(dreamers)

##  [1] "status"                               
##  [2] "copyright"                            
##  [3] "response.docs.web_url"                
##  [4] "response.docs.snippet"                
##  [5] "response.docs.lead_paragraph"         
##  [6] "response.docs.abstract"               
##  [7] "response.docs.print_page"             
##  [8] "response.docs.source"                 
##  [9] "response.docs.multimedia"             
## [10] "response.docs.keywords"               
## [11] "response.docs.pub_date"               
## [12] "response.docs.document_type"          
## [13] "response.docs.news_desk"              
## [14] "response.docs.section_name"           
## [15] "response.docs.subsection_name"        
## [16] "response.docs.type_of_material"       
## [17] "response.docs._id"                    
## [18] "response.docs.word_count"             
## [19] "response.docs.uri"                    
## [20] "response.docs.headline.main"          
## [21] "response.docs.headline.kicker"        
## [22] "response.docs.headline.content_kicker"
## [23] "response.docs.headline.print_headline"
## [24] "response.docs.headline.name"          
## [25] "response.docs.headline.seo"           
## [26] "response.docs.headline.sub"           
## [27] "response.docs.byline.original"        
## [28] "response.docs.byline.person"          
## [29] "response.docs.byline.organization"    
## [30] "response.meta.hits"                   
## [31] "response.meta.offset"                 
## [32] "response.meta.time"

# The search returned 10 articles with 33 columns bc each page/request has a max of 10 articles
dim(dreamers)

## [1] 10 32

# Set some parameters to grab all the hits by identifying a date range and max page # to loop through
term <- "dreamers" 
begin_date <- "20190101" # YYYYMMDD
end_date <- "20190331"

# Concatenate pieces of the url for the api call
baseurl <- paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",term,
                  "&begin_date=",begin_date,"&end_date=",end_date,
                  "&facet_filter=true&api-key=",nyt_key, sep="")

# Identify the # of hits to calculate the max pages 
initialQuery <- fromJSON(baseurl)
print(initialQuery$response$meta$hits[1]) # returns the total # of hits

## [1] 83

maxPages <- ceiling((initialQuery$response$meta$hits[1] / 10) -1) # reduce by 1 because loop starts with page 0
print(maxPages) # 8 is the max page, so starting from 0, a total of 9 pages or results

## [1] 8

# Loop through all pages to get all the hits
pages <- list()
for(i in 0:maxPages){
  nytSearch <- fromJSON(paste0(baseurl, "&page=", i), flatten = TRUE) %>% data.frame() 
  message("Retrieving page ", i)
  pages[[i+1]] <- nytSearch 
  Sys.sleep(9) # because there are 3 previous calls in under a min and the api call limit is 10/min 
}

# Save the page results in a dataframe
dreamer_search <- rbind_pages(pages)
# Take a peek at 2 informative columns. Note noise. Book references arelikely unrelated to registered undocumented individuals 
head(dreamer_search, n=10)[c('response.docs.web_url', 'response.docs.snippet')]

##                                                                                         response.docs.web_url
## 1  https://www.nytimes.com/2019/03/18/arts/music/review-dreamers-oratorio-jimmy-lopez-nilo-cruz-berkeley.html
## 2                         https://www.nytimes.com/2019/02/06/books/review/dreamers-karen-thompson-walker.html
## 3                         https://www.nytimes.com/2019/01/07/books/review-dreamers-karen-thompson-walker.html
## 4                      https://www.nytimes.com/2019/02/07/us/california-today-dreamer-state-of-the-union.html
## 5                                https://www.nytimes.com/2019/01/11/opinion/dreamer-rhodes-scholar-human.html
## 6                                https://www.nytimes.com/2019/01/19/us/politics/trump-proposal-daca-wall.html
## 7                             https://www.nytimes.com/2019/01/22/us/politics/supreme-court-daca-dreamers.html
## 8                           https://www.nytimes.com/2019/01/21/us/politics/democrats-trump-dreamers-deal.html
## 9             https://www.nytimes.com/video/us/politics/100000006315917/trump-shutdown-immigration-video.html
## 10                 https://www.nytimes.com/2019/02/01/theater/paradise-square-musical-berkeley-drabinsky.html
##                                                                                                                                                                                                    response.docs.snippet
## 1                                                                         Jimmy López and Nilo Cruz’s work, about the experiences of immigrants, had its premiere with the Philharmonia Orchestra and Esa-Pekka Salonen.
## 2                                                                           In Karen Thompson Walker’s second novel, people stop waking up in the morning. They’re not dead, just trapped in a dream-filled netherworld.
## 3                                                                        Karen Thompson Walker’s second novel is about a virus that causes people to nod off for very long periods and dream in disastrous premonitions.
## 4                                                                                                           Thursday: We talked to an undocumented Cal State Fullerton student about her trip to the State of the Union.
## 5                                                                                                            A person shouldn’t have to be a “genius” or “economically productive” to have access to equal opportunity. 
## 6                                       President Trump cast the proposal, which included $5.7 billion for a border barrier, as a compromise as he sought to shift pressure to Democrats to end the government shutdown.
## 7                                                                                   The court’s inaction almost certainly means it will not hear the administration’s challenge in its current term, which ends in June.
## 8                                                          Democrats described the proposal, offering temporary protection for young undocumented immigrants in exchange for border wall funding, as a “hostage taking.”
## 9  In a White House address, President Trump announced a plan that would provide temporary protection from deportation for some immigrants in exchange for $5.7 billion in funding for a wall on the U.S.-Mexico border.
## 10                                                                      “Paradise Square” is the most expensive show Berkeley Repertory Theater has ever done. Its creators are prestigious, its major patron notorious.

# tidyverse has a conflict with jsonlite so import it later
library(tidyverse)

Make sense of output with some plots

Among type of materials do articles on Dreamers tend to appear?

# Visualize coverage of dreamers by type of material
dreamer_search %>% 
  group_by(response.docs.type_of_material) %>%
  summarize(count=n()) %>%
  mutate(percent = (count / sum(count))*100) %>%
  ggplot() +
  geom_bar(aes(y=percent, x=response.docs.type_of_material, fill=response.docs.type_of_material), stat = "identity") + coord_flip()

Articles on Dreamers tend to be concentrated under the News materials then in the Op-ed materials From an initial glance at the results, the Review section is likely not referring to the Dreamers who are registered undocumented individuals living in the U.S.

What sections do Dreamer articles tend to appear?

# Visualize coverage of dreamers by section
dreamer_search %>% 
  group_by(response.docs.section_name) %>%
  summarize(count=n()) %>%
  mutate(percent = (count / sum(count))*100) %>%
  ggplot() +
  geom_bar(aes(y=percent, x=response.docs.section_name, fill=response.docs.section_name), stat = "identity") + coord_flip()

We see that almost the majority of articles written on Dreamers is in the U.S. section. 14% of articles are in the Opinion section, then 8% in Books and 7% in Arts. The latter categories may be noise.

When were Dreamer articles more frequent in the last 3 months?

# plot trends in how frequently dreamers are mentioned in the last 3 months
dreamer_search %>%
  mutate(pubDay=gsub("T.*","",response.docs.pub_date)) %>%
  group_by(pubDay) %>%
  summarise(count=n()) %>%
  mutate(date=ymd(pubDay)) %>%
  arrange(pubDay) %>%
  #filter(count >= 2) %>%
  ggplot() +
  geom_bar(aes(x=pubDay, y=count), stat="identity") + coord_flip()

We see spikes in the early second half of January, indicating that there may have been incidents regarding Dreamers earlier in the month. Since then, we see that there’s been fewer articles but still steady coverage in February and March. To posit a theory about the spike, it occurs after a series of political incidents, starting with Pres. Trump’s demand for funding for a border wall. He threatened a government shut-down if his demands were not met. In response, some Democrats responded by fighting for protections for Dreamers and DACA recipients. As such, the trends above may be reflecting the consequences of these political events.