Load libraries

library(jsonlite)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()  masks stats::filter()
## x purrr::flatten() masks jsonlite::flatten()
## x dplyr::lag()     masks stats::lag()

library(stringr)

Connecting to API

Key and secret provided from NYTimes API site: “https://developer.nytimes.com/”

key <- "APHySXOpNOGSc10kLBVm8ZdVljlRIRlf"
secret <- "2JYUVWVz9f6QGrkS"

With the search api, we can list key-words to filter for, and also filter by time. For this assignment, we will look for articles related to “data-science”, published anytime between the start of our semester, to today.

term <- "data-science"
begin <- "20210816"
end <- "20211024"

We create a “base_url” with all of the required parameters for API querying using stringr::str_c

base_url <- str_c("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=", term, "&begin_date=", begin, "&end_date=", end, "&facet_filter=true&api-key=", key, sep="")

Extracting data into dataframe

First we will perform an initial query which will return a set of documents. Part of the information we will receive contains the “maxPages” from the search. Taking this number, we will iterate through values 1:maxPages and extract all documents for each page in the page range. All of the results will be flattened into a dataframe, and then loaded into a list.

initialQuery <- fromJSON(base_url)

#maxPages <- round((initialQuery$response$meta$hits[1] / 10)-1) 

pages_2021 <- vector("list",length=15)

for(i in 0:15){
    nytSearch <- fromJSON(paste0(base_url, "&page=", i), flatten = TRUE) %>% data.frame()
    pages_2021[[i+1]] <- nytSearch
    Sys.sleep(5)
}

Now that we have all of our dataframes in a list, we can use a simple r_bind call to aggregate them all into the same dataframe

nyt_2021_articles <- rbind_pages(pages_2021)

Looking below, we have successfully extracted and loaded over 200 NYTimes articles into a dataframe.

head(nyt_2021_articles)

Conclusion

Based on the above, it is very straightforward to extract data from the NYTimes search API. In future exploration, I will further examine some of the nested columns that are present (like keywords and persons).

607_homework_9

Alec

10/24/2021

Load libraries

Connecting to API

Extracting data into dataframe

Conclusion