This document outlines the process of retrieving data from an API so that it can be analyzed in R. In this case, we will be using the New York Times APIs as our example. More specifically, we will be pulling data from NYT’s Times Newswire API, which “provides an up-to-the-minute stream of published articles.” The steps outlined below will utilize the documentation provided for this specific API in order to connect and pull data.
In order to first use the NYT API, you will need to create an app with an authorization key that can be used to access their data. The steps for doing this are outlined here.
When running this .Rmd file, you should be prompted to
enter this key so that it can be used to authenticate the api calls
(make sure to run using the Knit With Parameters...
option). You can also enter the into the file manually below by
un-commenting the line mentioned in the code block below. Either way,
this key will be stored as the variable api_key. The
documentation for the specific API that is used provides the required
text for the other two variables defined below, api_url and
num_results. The api_url is merely the URL
location where we will be pulling the data, which can be found once you
have determined the NYT API you wish to use (listed here). The
num_results will tell the request the number of rows that
we want to return (can be any multiple of 20 up to 500).
# Enter NYT API Key'
api_key <- params$key
# Un-comment below line for manual entry.
#api_key <- 'dummy-api-key'
# Enter NYT API web location
api_url <- 'https://api.nytimes.com/svc/news/v3/content/all/all.json?'
# Number of results
num_results <- 500
The code chunk below combines the three variables above into a single
API call via the instructions provided in the API documentation. It then
uses the fromJSON function to actually carry out this API
call, and the results are transformed into a R dataframe,
df.
api_call <- paste(api_url, "limit=", num_results, "&api-key=", api_key, sep='')
df <- fromJSON(api_call, flatten = TRUE) %>% data.frame
glimpse(df)
## Rows: 500
## Columns: 27
## $ status <chr> "OK", "OK", "OK", "OK", "OK", "OK", "OK",…
## $ copyright <chr> "Copyright (c) 2022 The New York Times Co…
## $ num_results <int> 500, 500, 500, 500, 500, 500, 500, 500, 5…
## $ results.slug_name <chr> "25pol-padebate-takeaways", "midterm-blog…
## $ results.section <chr> "U.S.", "U.S.", "U.S.", "Today’s Paper", …
## $ results.subsection <chr> "Politics", "Politics", "Politics", "", "…
## $ results.title <chr> "5 Takeaways From the Pennsylvania Senate…
## $ results.abstract <chr> "Lt. Gov. John Fetterman and Mehmet Oz tr…
## $ results.uri <chr> "nyt://article/69d757ba-dd74-5900-b0d2-f1…
## $ results.url <chr> "https://www.nytimes.com/2022/10/25/us/po…
## $ results.byline <chr> "BY KATIE GLUECK AND TRIP GABRIEL", "BY C…
## $ results.item_type <chr> "Article", "Article", "Article", "Article…
## $ results.source <chr> "New York Times", "New York Times", "New …
## $ results.updated_date <chr> "2022-10-25T23:41:42-04:00", "2022-10-25T…
## $ results.created_date <chr> "2022-10-25T23:41:42-04:00", "2022-10-25T…
## $ results.published_date <chr> "2022-10-25T23:41:42-04:00", "2022-10-25T…
## $ results.first_published_date <chr> "2022-10-25T23:41:42-04:00", "2022-10-25T…
## $ results.material_type_facet <chr> "News", "Live Blog Post", "News", "Quote"…
## $ results.kicker <chr> "", "", "", "", "", "", "", "wordplay, th…
## $ results.subheadline <chr> "", "", "", "", "", "", "", "", "", "", "…
## $ results.des_facet <list> <"Midterm Elections (2022)", "United Sta…
## $ results.org_facet <list> <"Democratic Party", "Republican Party",…
## $ results.per_facet <list> <"Fetterman, John (1969- )", "Oz, Mehmet…
## $ results.geo_facet <list> "Pennsylvania", <NULL>, "Pennsylvania", …
## $ results.related_urls <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, …
## $ results.multimedia <list> <NULL>, [<data.frame[4 x 8]>], [<data.fr…
## $ results.thumbnail_standard <chr> NA, "https://static01.nyt.com/images/2022…
The results shown above show that our API call has produced the desired results.
Now that the data has been loaded, we can move forward with a number
of data cleaning steps. First, the actual fields we need in this case
are only those with names preceded by the results. prefix,
as this was the actual JSON element that contained the list of results.
The status, copyright, and
num_results fields were also their own JSON elements, but
contain repeated information about the API call that can now be removed.
This is done in the cell below:
df <- df %>%
select(-status, -copyright, -num_results)
glimpse(df)
## Rows: 500
## Columns: 24
## $ results.slug_name <chr> "25pol-padebate-takeaways", "midterm-blog…
## $ results.section <chr> "U.S.", "U.S.", "U.S.", "Today’s Paper", …
## $ results.subsection <chr> "Politics", "Politics", "Politics", "", "…
## $ results.title <chr> "5 Takeaways From the Pennsylvania Senate…
## $ results.abstract <chr> "Lt. Gov. John Fetterman and Mehmet Oz tr…
## $ results.uri <chr> "nyt://article/69d757ba-dd74-5900-b0d2-f1…
## $ results.url <chr> "https://www.nytimes.com/2022/10/25/us/po…
## $ results.byline <chr> "BY KATIE GLUECK AND TRIP GABRIEL", "BY C…
## $ results.item_type <chr> "Article", "Article", "Article", "Article…
## $ results.source <chr> "New York Times", "New York Times", "New …
## $ results.updated_date <chr> "2022-10-25T23:41:42-04:00", "2022-10-25T…
## $ results.created_date <chr> "2022-10-25T23:41:42-04:00", "2022-10-25T…
## $ results.published_date <chr> "2022-10-25T23:41:42-04:00", "2022-10-25T…
## $ results.first_published_date <chr> "2022-10-25T23:41:42-04:00", "2022-10-25T…
## $ results.material_type_facet <chr> "News", "Live Blog Post", "News", "Quote"…
## $ results.kicker <chr> "", "", "", "", "", "", "", "wordplay, th…
## $ results.subheadline <chr> "", "", "", "", "", "", "", "", "", "", "…
## $ results.des_facet <list> <"Midterm Elections (2022)", "United Sta…
## $ results.org_facet <list> <"Democratic Party", "Republican Party",…
## $ results.per_facet <list> <"Fetterman, John (1969- )", "Oz, Mehmet…
## $ results.geo_facet <list> "Pennsylvania", <NULL>, "Pennsylvania", …
## $ results.related_urls <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, …
## $ results.multimedia <list> <NULL>, [<data.frame[4 x 8]>], [<data.fr…
## $ results.thumbnail_standard <chr> NA, "https://static01.nyt.com/images/2022…
Next, the cell below cleans the column names to remove the
results. prefix from each column name:
colnames(df) <- str_replace_all(colnames(df), 'results.', '')
colnames(df)
## [1] "slug_name" "section" "subsection"
## [4] "title" "abstract" "uri"
## [7] "url" "byline" "item_type"
## [10] "source" "updated_date" "created_date"
## [13] "published_date" "first_published_date" "material_type_facet"
## [16] "kicker" "subheadline" "des_facet"
## [19] "org_facet" "per_facet" "geo_facet"
## [22] "related_urls" "multimedia" "thumbnail_standard"
As is clear from the output above, all column names have been successfully cleaned.
Lastly, looking at the data, we can see that the elements in the
byline column are always preceded by the string
"BY ". To fix this we replace this column with an
authors column that removes this extra string:
df <- df %>%
mutate(author = str_extract(byline, '(?<=BY ).*'))
df <- df %>%
select(-byline)
head(df$author)
## [1] "KATIE GLUECK AND TRIP GABRIEL" "CARL HULSE"
## [3] "SHANE GOLDMACHER" NA
## [5] "JESSE GREEN" "NEWSNATION"
The result of running the code in the above sections is that we now
have a clean set of data containing information about the past 500 most
recently published New York Times articles. Thus, we can now use this
data to answer some interesting questions. (Note that this
.Rmd file was first run on October 25th, 2022, and because
this data is constantly being updated the following commentary may not
match the results seen if the code is rerun. However, the code itself
should still be able to answer the questions that follow).
The following code breaks up the most recently published articles by section, and presents the information as a bar chart:
plt_df <- df %>%
group_by(section) %>%
summarise(
count_articles = n(),
pct_articles = (n() / nrow(df)) * 100
) %>%
arrange(desc(count_articles))
ggplot(data=plt_df,aes(x=reorder(section, pct_articles), y=pct_articles)) + geom_bar(stat='identity') +
coord_flip() +
labs(
x = 'Section',
y = '% of Articles In Section',
title = 'Percent Published By Section (Last 500 Articles)'
) +
theme(axis.text.y = element_text(size = 8))
As is clear in the plot above, the most commonly published to section
is World.
To answer this question, the cell below determines the hour each
article was first published and stores that information in the
first_published_hour column. The data is then grouped by
that column to make a plot of the number of articles published over
time:
df <- df %>%
mutate(first_published_hour =
anytime(
x = paste(str_extract(first_published_date, '^[^:]*'), ':00:00',sep=''),
tz = Sys.timezone()
)
)
plt_data <- df %>%
group_by(first_published_hour) %>%
summarise(count_articles = n()) %>%
arrange(desc(first_published_hour))
x <- df$first_published_hour
ggplot(data = plt_data, aes(x=first_published_hour, y=count_articles)) +
geom_line() +
scale_x_datetime(
breaks = seq.POSIXt(from = min(x), to = max(x), by = "1 days"),
minor_breaks = seq.POSIXt(from = min(x), to = max(x), by = "4 hours")
) +
theme(axis.text.x = element_text(angle=8)) +
labs(
x = 'Hour Published',
y = 'Number of Articles Published',
title = 'NYT Articles Published By Hour (Last 500 Articles)'
)
As is to be expected, the majority of articles seem to be published in the 5:00 PM to 6:00 PM hour likely right before everyone leaves for the day.
This document used New York Times publicly available data to demonstrate the process of how one could use an API to load data and answer some interesting research questions. Though there will be differences in how to interact with APIs other than the ones provided by NYT, the general process flow described and carried out above should remain relatively similar across other APIs.