DATA 607 - Assignment 9
The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/apis. You’ll need to start by signing up for an API key.
Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.
Connecting and Requesting from the API
I am using the Top Stories API health and science sections.
url <- paste0('https://api.nytimes.com/svc/topstories/v2/science.json?api-key=', Sys.getenv("TIMES_API_KEY"))
url2 <- paste0('https://api.nytimes.com/svc/topstories/v2/health.json?api-key=', Sys.getenv("TIMES_API_KEY"))
science_data <- fromJSON(url)$results %>%
as.data.frame() %>%
add_column("web_section" = "Science", .before = "section")
health_data <- fromJSON(url2)$results %>%
as.data.frame() %>%
add_column("web_section" = "Health", .before = "section")Science
rmarkdown::paged_table(science_data)Health
rmarkdown::paged_table(health_data)Tidying the Data
Nested Format
In this format, I have removed columns that I do not want in the
final data frame and I have left the des_facet and
org_facet columns nested.
science_and_health_nested <- rbind(science_data, health_data) %>%
select(web_section, title, abstract, url, "author" = byline, published_date, updated_date, des_facet, org_facet)
rmarkdown::paged_table(science_and_health_nested)In this way, one has all the necessary information and can un-nest
the des_facet or org_facet columns as they see
fir for analysis.
Long Format
For the long format, the des_facet and
org_facet columns are un-nested. The result is a set of
data tables that have multiple rows for each article based on the number
of tags in the des_facet or org_facet
columns.
# unnest des_facet
science_and_health_des_long <- rbind(science_data, health_data) %>%
select(web_section, title, abstract, url, "author" = byline, published_date, updated_date, des_facet, org_facet) %>%
unnest(des_facet)
#unnest org_facet
science_and_health_org_long <- rbind(science_data, health_data) %>%
select(web_section, title, abstract, url, "author" = byline, published_date, updated_date, des_facet, org_facet) %>%
unnest(org_facet)
rmarkdown::paged_table(science_and_health_des_long)rmarkdown::paged_table(science_and_health_org_long)In this format, the tags in the des_facet and
org_facet columns can be analyzed. Alternatively, one could
keep the nested data frame and un-nest the column that they wish to
analyze.
Short Format
For the short format, the des_facet and
org_facet columns are unlisted and turned into a string in
the data table.
science_and_health <- rbind(science_data, health_data) %>%
select(web_section, title, abstract, url, "author" = byline, published_date, updated_date, des_facet, org_facet)
############ unlisting des_facets #############
des_facets <- list()
for (i in 1:nrow(science_and_health)) {
for (j in 1:length(unlist(science_and_health$des_facet[i]))) {
temp <- paste(unlist(science_and_health$des_facet[i]), collapse='; ')
}
des_facets <- append(des_facets, list(temp))
}
des_facets <- data.frame(des_facets)
names(des_facets) <- c(1:length(names(des_facets)))
des_facets <- des_facets %>%
pivot_longer(cols = names(des_facets), names_to = "article", values_to = "des_facet")
science_and_health <- science_and_health %>%
mutate(des_facet = des_facets$des_facet)
############ unlisting org_facets #############
org_facets <- list()
for (i in 1:nrow(science_and_health)) {
for (j in 1:length(unlist(science_and_health$org_facet[i]))) {
temp <- paste(unlist(science_and_health$org_facet[i]), collapse='; ')
}
org_facets <- append(org_facets, list(temp))
}
org_facets <- data.frame(org_facets)
names(org_facets) <- c(1:length(names(org_facets)))
org_facets <- org_facets %>%
pivot_longer(cols = names(org_facets), names_to = "article", values_to = "org_facet")
science_and_health <- science_and_health %>%
mutate(org_facet = org_facets$org_facet)
rmarkdown::paged_table(science_and_health)Some Analysis
Which organizations are mentioned most?
For this, I will use the already un-nested
science_and_health_org_long table.
Table
most_common_org <- science_and_health_org_long %>%
count(org_facet) %>%
arrange(desc(n)) %>%
filter(n > 1)
most_common_org %>%
knitr::kable(col.names = c("Organization", "Number of Articles"))| Organization | Number of Articles |
|---|---|
| Centers for Disease Control and Prevention | 8 |
| Food and Drug Administration | 6 |
| Current Biology (Journal) | 2 |
| Eli Lilly and Company | 2 |
| Environmental Protection Agency | 2 |
| New England Journal of Medicine | 2 |
| Novo Nordisk A/S | 2 |
| Republican Party | 2 |
| Sanofi SA | 2 |
| Senate Committee on Homeland Security and Governmental Affairs | 2 |
Graph
most_common_org %>%
ggplot(aes(x = n, y = reorder(org_facet, n))) +
geom_bar(stat = "identity", fill = "darkred") +
geom_text(aes(label = n), position = position_stack(vjust = 0.9), fontface = 'bold', color = 'white') +
labs(title = "Most Mentioned Organizations", x = "", y = "")The FDA and CDC are the most mentioned organizations.
Conclusions
The goal of this assignment was to in JSON data from the New York Times API and transform it into an R data frame. The “Tidying the Data” section provides three separate ways of organizing this data in an R data frame.
In order to access the NYT API, follow the steps listed here on their website.