The New York Times web site provides APIs, as described here: https://developer.nytimes.com/apis. The objective of this assignment is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame. For the purposes of this assignment, I chose to work with the “Archive API” which returns an array of NYT articles for a given month, going back to 1851.
First, load required packages.
library(tidyverse)
library(jsonlite)
library(lubridate)
In order to use the NYTimes API you need to get an authorization key by creating an account, selecting an App, enabling the app, and making the key, this process is explained here, https://developer.nytimes.com/get-started.
# NYTAuth <- ""
NYTAuth <- rstudioapi::askForPassword("Authorization Key")
Now that the key is stored, the NYT website can be queried using the API. Try querying the database.
year <- "2020"
month <- "3"
baseurl <- paste0("https://api.nytimes.com/svc/archive/v1/", year, "/", month, ".json?api-key=", NYTAuth, sep="")
query1 <- fromJSON(baseurl)
Now, take a quick look at some of the aspects of query1.
dim(query1$response$docs)
## [1] 4883 20
colnames(query1$response$docs)
## [1] "abstract" "web_url" "snippet" "lead_paragraph"
## [5] "print_section" "print_page" "source" "multimedia"
## [9] "headline" "keywords" "pub_date" "document_type"
## [13] "news_desk" "section_name" "subsection_name" "byline"
## [17] "type_of_material" "_id" "word_count" "uri"
class(query1$response$docs)
## [1] "data.frame"
query1$response$meta
## $hits
## [1] 4883
The query1
contains a list of two elements - one for the
meta data, meta
and one for the docs
which is
the archive of NYT articles. The docs
dataframe contains 20
variables relevant to the archived article.
For my interests, I’d like to see the about COVID mention in the
months following the outbreak, the relevant information to me is a
subset of the docs
dataframe with the edition of the year
and the month.
I will need to be able to call all of the months, write a script to take a start month and year and a number of months and create a dataframe with a list of all of the year, month combinations.
df_test <- data.frame(year = integer(),
month = integer())
months_total = 24
month_start = 2
year_start = 2020
for (i in 0:months_total){
if (any((month_start + i)/12 == c(1:months_total))){
# special case
year <- year_start + (month_start + i - 1) %/% 12
month <- 12
} else {
year <- year_start + (month_start + i) %/% 12
month <- (month_start + i) %% 12
}
df_temp <- data.frame(year = year, month = month)
df_test <- rbind(df_test, df_temp)
}
# df_test
I inspected the dataframe and it has the correct output.
The above script will now be used to help call all the relevant months.
year_start <- 2019
month_start <- 12
months_total <- 14
year_end <- year_start + months_total %/% 12
month_end <- month_start + months_total %% 12
df <- data.frame(id = integer(),
year = integer(),
month = integer(),
abstract = character(),
web_url = character(),
snippet = character(),
lead_paragraph = character(),
print_section = character(),
print_page = character(),
source = character(),
pub_date = character(),
headline = character())
for (i in 0:months_total){
if (any((month_start + i)/12 == c(1:months_total))){
# special case
year <- year_start + (month_start + i - 1) %/% 12
month <- 12
} else {
year <- year_start + (month_start + i) %/% 12
month <- (month_start + i) %% 12
}
baseurl <- paste0("https://api.nytimes.com/svc/archive/v1/", year, "/", month, ".json?api-key=", NYTAuth, sep="")
query1 <- fromJSON(baseurl)
df_query <- query1$response$docs
df_temp <- data.frame(id = i+1,
year = year,
month = month,
abstract = df_query$abstract,
web_url = df_query$web_url,
snippet = df_query$snippet,
lead_paragraph = df_query$lead_paragraph,
print_section = df_query$print_section,
print_page = df_query$print_page,
source = df_query$source,
pub_date = df_query$pub_date,
headline = df_query$headline)
df <- rbind(df, df_temp)
}
Now the relevant data is in a dataframe that can be used for processing. For a brief analysis I will check for COVID occurrences in the various fields. A similiar approach was used in Project 3 for detecting skills matches.
covid <- c("covid", "covid19", "covid-19", "pandemic", "sars-cov-2", "coronavirus", "epidemic", "quarantine")
covid_regex <- paste0('\\b', paste(covid, collapse = '\\b|\\b'), '\\b')
df_covid <- data.frame(id = integer(),
covid = integer())
for (i in 1:dim(df)[1]){
all_relevant <- str_c(df$headline.main[i],
df$headline.print_headline[i],
df$abstract[i],
df$lead_paragraph[i],
df$snippet[i])
temp <- str_extract_all(all_relevant, regex(covid_regex, ignore_case=TRUE))
if (is_empty(temp[[1]])){
covid <- 0
} else {
covid <- 1
}
df_temp <- data.frame(id = i,
covid = covid)
df_covid <- rbind(df_covid, df_temp)
}
Now I can make a plot of those COVID occurrences.
df_all <- cbind(df, df_covid$covid)
df_covid_pos <- df_all |> filter(`df_covid$covid` == 1) |>
mutate(date_string = str_extract(pub_date, "[0-9]+\\-[0-9]+\\-[0-9]+"),
date_date = ymd(date_string)) |>
group_by(date_date) |>
summarise(count = n()) |>
filter(count < 250)
df_covid_pos |> ggplot(aes(x= date_date, y = count)) +
geom_point() +
geom_smooth(method = 'loess', formula = 'y ~ x')
The objective of this assignment, to get relevant data from the NYT and construct a dataframe from it, was met. Using the data I was able to do an analysis on Covid-19 in the news throughout the pandemic. A similar approach could be used to investigate the news following other big stories, for example another analysis idea I had was to look at news for Black Lives Matter prior to and after the death of George Floyd.