1. Introduction

The New York Times offers APIs to get news articles programmatically. Here, I demonstrate the use of the Top Stories API to retrieve articles that are on the homepage of the newspaper and convert the data to a dataframe.


2. Data

I created an account to get an API key but don’t show it because RPubs is public. The key is included in the call to the API.

api_key <- rstudioapi::askForPassword("API Key")
url <- paste("https://api.nytimes.com/svc/topstories/v2/home.json?api-key=", api_key, sep = "") 

The API returns data in JSON format so I used fromJSON()

top_stories <- fromJSON(url, flatten = TRUE)

Note that the retrieved data may change from run to run as the top stories are likely updated frequently.


3. Data checks and transformations

The structure of the JSON data can be seen in a data viewer.

View(top_stories)

The top stories are in level 1, element 6 (“results”), which is already a dataframe, so I started the data transformations from there.

3.1. Convert “facet” columns

The “_facet” columns are character lists, so I unlisted and converted them to strings.

top_stories <- top_stories[[6]] %>%
  rowwise() %>%
  mutate(
    des_facet = toString(unlist(des_facet)),
    org_facet = toString(unlist(org_facet)),
    per_facet = toString(unlist(per_facet)),
    geo_facet = toString(unlist(geo_facet))
  ) %>%
  ungroup()

Now the dataframe looks like this:

glimpse(top_stories)
## Rows: 27
## Columns: 19
## $ section             <chr> "world", "world", "world", "us", "us", "world", "w…
## $ subsection          <chr> "europe", "europe", "europe", "politics", "", "eur…
## $ title               <chr> "In First Remarks on Attack, Putin Tries to Link A…
## $ abstract            <chr> "American officials, who have assessed that a bran…
## $ url                 <chr> "https://www.nytimes.com/2024/03/23/world/europe/m…
## $ uri                 <chr> "nyt://article/bd065841-6d43-573d-9890-07e054d0d35…
## $ byline              <chr> "By Anton Troianovski", "By Julian E. Barnes and E…
## $ item_type           <chr> "Article", "Article", "Article", "Article", "Artic…
## $ updated_date        <chr> "2024-03-23T12:25:20-04:00", "2024-03-23T12:28:04-…
## $ created_date        <chr> "2024-03-23T09:58:43-04:00", "2024-03-22T18:36:11-…
## $ published_date      <chr> "2024-03-23T09:58:43-04:00", "2024-03-22T18:36:11-…
## $ material_type_facet <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ kicker              <chr> "", "", "", "", "news analysis", "", "", "", "", "…
## $ des_facet           <chr> "Terrorism", "Moscow Concert Hall Shooting (March …
## $ org_facet           <chr> "", "Islamic State Khorasan", "Islamic State Khora…
## $ per_facet           <chr> "Putin, Vladimir V", "", "", "Johnson, Mike (1972-…
## $ geo_facet           <chr> "Moscow (Russia), Russia", "Moscow (Russia)", "Rus…
## $ multimedia          <list> [<data.frame[3 x 8]>], [<data.frame[3 x 8]>], [<d…
## $ short_url           <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""…

3.2. Convert “date” columns

The date-times in the “date” columns appear to be in ISO8601 format as described in R for Data Science, 2nd edition, chapter 17, such that the date and time components are separated by a “T”. I also noticed that all times end with “-04:00”, which is the difference between Eastern Daylight Time and UTC (Coordinated Universal Time).1

So converting these columns to datetime format is straightforward:

top_stories <- top_stories %>%
  mutate (
    updated_date = as_datetime(updated_date),
    created_date = as_datetime(created_date),
    published_date = as_datetime(published_date),    
  )

3.3. Rename columns

I renamed the following columns for clarity:

  • Date-time columns include time in the column name

  • Differentiate “url” (Uniform Resource Locator = website address) and “uri” (Uniform Resource identifier), which are hard to tell apart by the lowercase acronym

  • Use more descriptive names for the “facet” columns2

top_stories <- top_stories %>%
  rename(web_url = url) %>%
  rename(resource_identifier = uri) %>%
  rename(updated_datetime = updated_date) %>%
  rename(created_datetime = created_date) %>%  
  rename(published_datetime = published_date) %>%
  rename(terms_descriptive_subject = des_facet) %>%
  rename(terms_organization = org_facet) %>%
  rename(terms_person = per_facet) %>%
  rename(terms_geographic_area = geo_facet)

3.4. Unnest and unpack “multimedia” column

I debated how to handle the “multimedia” column, which is a list of dataframes. According to tidy data principles, it should be unnested longer, but the resulting dataframe is less readable (ie, because each row (top story) is repeated several times for each multimedia type. I wonder if the multimedia column could be left as a more compact dataframe if it isn’t important for a data analysis.

However, for the purpose of this assignment, I went ahead and unnested it. This led me to discover that the unpack function is also needed because tibbles are unnested as 2D columns.3 The keep_empty = TRUE is to prevent dropping rows for null elements.

top_stories <- top_stories %>%
  unnest_longer(multimedia, keep_empty = TRUE) %>%
  unpack(cols = multimedia, names_sep = "_")  

3.5. Missing values

Finally, I filled missing values with NAs. This was a little tricky until I realized that na_if only works with character vectors.4

top_stories <- top_stories %>%
  mutate(
    across(
      where(is.character), ~ na_if(., "")
    )
  )

Now the dataframe looks better and is ready for analysis (not part of this assignment).

glimpse(top_stories)
## Rows: 81
## Columns: 26
## $ section                   <chr> "world", "world", "world", "world", "world",…
## $ subsection                <chr> "europe", "europe", "europe", "europe", "eur…
## $ title                     <chr> "In First Remarks on Attack, Putin Tries to …
## $ abstract                  <chr> "American officials, who have assessed that …
## $ web_url                   <chr> "https://www.nytimes.com/2024/03/23/world/eu…
## $ resource_identifier       <chr> "nyt://article/bd065841-6d43-573d-9890-07e05…
## $ byline                    <chr> "By Anton Troianovski", "By Anton Troianovsk…
## $ item_type                 <chr> "Article", "Article", "Article", "Article", …
## $ updated_datetime          <dttm> 2024-03-23 16:25:20, 2024-03-23 16:25:20, 2…
## $ created_datetime          <dttm> 2024-03-23 13:58:43, 2024-03-23 13:58:43, 2…
## $ published_datetime        <dttm> 2024-03-23 13:58:43, 2024-03-23 13:58:43, 2…
## $ material_type_facet       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ kicker                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ terms_descriptive_subject <chr> "Terrorism", "Terrorism", "Terrorism", "Mosc…
## $ terms_organization        <chr> NA, NA, NA, "Islamic State Khorasan", "Islam…
## $ terms_person              <chr> "Putin, Vladimir V", "Putin, Vladimir V", "P…
## $ terms_geographic_area     <chr> "Moscow (Russia), Russia", "Moscow (Russia),…
## $ multimedia_url            <chr> "https://static01.nyt.com/images/2024/03/23/…
## $ multimedia_format         <chr> "Super Jumbo", "threeByTwoSmallAt2X", "Large…
## $ multimedia_height         <int> 1463, 400, 150, 720, 400, 150, 1537, 400, 15…
## $ multimedia_width          <int> 2048, 600, 150, 1080, 600, 150, 2048, 600, 1…
## $ multimedia_type           <chr> "image", "image", "image", "image", "image",…
## $ multimedia_subtype        <chr> "photo", "photo", "photo", "photo", "photo",…
## $ multimedia_caption        <chr> "President Vladimir V. Putin of Russia at th…
## $ multimedia_copyright      <chr> "Nanna Heitmann for The New York Times", "Na…
## $ short_url                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …


4. Conclusion

I successfully retrieved New York Times top stories using its API and transformed the JSON data into an R dataframe, which could be used for subsequent data analysis.


  1. https://en.wikipedia.org/wiki/UTC%E2%88%9204:00↩︎

  2. https://open.nytimes.com/article-search-api-enhancements-3ec5bbc25f0c↩︎

  3. https://stackoverflow.com/questions/62328384/unnest-longer-gives-dollar-sign-instead-of-normal-tibble↩︎

  4. https://stackoverflow.com/questions/51449243/how-to-replace-empty-string-with-na-in-r-dataframe↩︎