This week, we were taseked with choosing a New York Times API, constructing an interface in R to read in the JSON data, and transform it into an R DataFrame.
I saved my key in an external file and loaded it into R.
source("api_credentials.R")
I used the fromJSON function from the jsonlite package to call the API. This function asks for the API link which includes a parameter for the period (in days), the method — a reference to the source counted (emails, shares, or views for each article) and your unique API key.
domain <- "https://api.nytimes.com"
path <- "svc/mostpopular/v2/"
method <- "viewed" # count to use (emailed, shared, or viewed)
period <- 7 # number of days
json <- fromJSON(paste(domain, "/", path, method, "/", period,".json?api-key=",api_key,sep=""))
The API returns a list as a response. To convert it to a dataframe, I used the as_tibble function from the tibble package included in Tidyverse.
most_popular_articles <- json$results |>
as_tibble()
glimpse(most_popular_articles)
## Rows: 20
## Columns: 22
## $ uri <chr> "nyt://article/94d9dddc-38ee-5af8-bb1a-a27fb191b994", "…
## $ url <chr> "https://www.nytimes.com/2024/10/28/us/politics/trump-m…
## $ id <dbl> 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14,…
## $ asset_id <dbl> 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14,…
## $ source <chr> "New York Times", "New York Times", "New York Times", "…
## $ published_date <chr> "2024-10-28", "2024-11-02", "2024-10-31", "2024-08-14",…
## $ updated <chr> "2024-10-29 22:00:34", "2024-11-02 14:49:55", "2024-11-…
## $ section <chr> "U.S.", "U.S.", "Opinion", "U.S.", "Opinion", "Style", …
## $ subsection <chr> "Politics", "Politics", "", "2024 Elections", "", "", "…
## $ nytdsection <chr> "u.s.", "u.s.", "opinion", "u.s.", "opinion", "style", …
## $ adx_keywords <chr> "Presidential Election of 2024;Discrimination;Comedy an…
## $ column <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ byline <chr> "By Maggie Haberman, Jonathan Swan and Michael Gold", "…
## $ type <chr> "Article", "Article", "Article", "Interactive", "Articl…
## $ title <chr> "Trump Team Fears Damage From Racist Rally Remarks", "T…
## $ abstract <chr> "The Trump campaign issued a rare statement distancing …
## $ des_facet <list> <"Presidential Election of 2024", "Discrimination", "C…
## $ org_facet <list> <>, <>, <>, <>, "Republican Party", <"Republican Natio…
## $ per_facet <list> <"Trump, Donald J", "Hinchcliffe, Tony (1984- )">, "Tr…
## $ geo_facet <list> "Puerto Rico", "Milwaukee (Wis)", <>, <>, <>, "Florida…
## $ media <list> [<data.frame[1 x 6]>], [<data.frame[1 x 6]>], [<data.f…
## $ eta_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
I also used the httr2 package to retrive the top articles viewed on the NYTimes site in the last 7 days. I first initilized a request with the domain name and passing the url Path. Then I added the request headers to expect a json format. I used req_dry_run to confim that I had created the right GET request.
method2 <- "shared" # from facebook
period2 <- 30 # num of days
req <- request(domain) |>
req_headers("Accept" = "application/json") |>
req_url_path(path = paste("/", path, method2, "/", period2,".json?api-key=",api_key,sep=""))
req |> req_dry_run()
## GET /svc/mostpopular/v2/shared/30.json?api-key=d9Pf5GfVoMVcX4aiuJhXBNgiAXwuxHzW HTTP/1.1
## Host: api.nytimes.com
## User-Agent: httr2/1.0.5 r-curl/5.2.3 libcurl/8.7.1
## Accept-Encoding: gzip
## Accept: application/json
I used req_perform to call the api using the request created in the last chunk. I passed the response to resp_body_string to parse the response to a JSON string object that called be passed to JSONlite to return a vector.
resp <- req_perform(req)
json_resp <- resp |>
resp_body_string() |>
fromJSON()
Next, I used as.dataframe to convert the JSON vector into a dataframe.
#convert to dataframe
top_articles <- as.data.frame(json_resp$results)
glimpse(top_articles)
## Rows: 20
## Columns: 22
## $ uri <chr> "nyt://article/5a824b76-4f83-5577-a10e-e69e1334a535", "…
## $ url <chr> "https://www.nytimes.com/2024/10/14/us/politics/trump-t…
## $ id <dbl> 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14,…
## $ asset_id <dbl> 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14,…
## $ source <chr> "New York Times", "New York Times", "New York Times", "…
## $ published_date <chr> "2024-10-14", "2024-09-30", "2024-10-19", "2024-10-30",…
## $ updated <chr> "2024-10-16 00:18:18", "2024-10-27 11:11:09", "2024-10-…
## $ section <chr> "U.S.", "Opinion", "U.S.", "U.S.", "Opinion", "Opinion"…
## $ subsection <chr> "Politics", "Editorials", "Politics", "Politics", "", "…
## $ nytdsection <chr> "u.s.", "opinion", "u.s.", "u.s.", "opinion", "opinion"…
## $ adx_keywords <chr> "Presidential Election of 2024;Trump, Donald J;Pennsylv…
## $ column <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ byline <chr> "By Michael Gold", "By The Editorial Board", "By Michae…
## $ type <chr> "Article", "Article", "Article", "Interactive", "Articl…
## $ title <chr> "Trump Bobs His Head to Music for 30 Minutes in Odd Tow…
## $ abstract <chr> "After multiple interruptions, Donald Trump cut off que…
## $ des_facet <list> "Presidential Election of 2024", <"United States Polit…
## $ org_facet <list> <>, <>, <>, <>, <"Republican Party", "Democratic Party…
## $ per_facet <list> "Trump, Donald J", <"Harris, Kamala D", "Trump, Donald…
## $ geo_facet <list> "Pennsylvania", <>, "Latrobe (Pa)", <>, <>, <>, <>, <"…
## $ media <list> [<data.frame[0 x 0]>], [<data.frame[1 x 6]>], [<data.f…
## $ eta_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
I used to methods for retrieving our dataset using an API: 1. JSONLite 2. HTTR2
For this particular API, the JSONLite package proved to be very easy to use. However, the HTTR2 package has many additional functionality that goes beyond the scope of this assignment which make a more powerful tool. In particularly, we can adjust the request variables to send data to the API and update our core database. Additionally, it allows us to receive and handle data in various formats including HTML and XML. Lastly, it can configured to check for the http status, allowing us to identify 404 and 500 errors in the event that the API call fails. This seems particularly usefule to catch errors particularly when breaking up our request into several JSON dumps through pagination and other filtering methods.