library(dotenv)
library(jsonlite)
library(tidyverse)
library(lubridate)
Demonstration utilizing the NY Times Article API, and coercing the search results into a data frame. The key words will be climate+change
and sustainability
during the year 2021 in NYC.
# load up hidden api key
article_api <- Sys.getenv("ARTICLE_API")
#semantic_api <- Sys.getenv("SEMANTIC_API")
# set base url
base_url_art <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q="
#base_url_sem <- "http://api.nytimes.com/svc/semantic/v2/concept/name"
# set parameters
term <- "sustainability+climate+change"
fq <- "&fq=glocation='New York City'"
begin_date <- "20210101"
end_date <- "20211231"
complete_url <- paste0(base_url_art,term,fq,"&begin_date=",begin_date,"&end_date=",end_date,"facet_filter=true&api-key=",article_api,sep = "")
# import dataset to R
sus <- fromJSON(complete_url)
# view how many hits
sus$response$meta$hits
## [1] 645
This only returns the first 10 results, I would like to return all 645 results in a data frame. This is where the jsonlite documentation came in handy. It explains Paging with jsonlite and how to combine pages of data. The following code loop was adapted directly from the documentation.
In order to combine all the requests during 2021, I need to see how many hits there were on the search term:
hits <- sus$response$meta$hits
cat("There were ",hits," hits for the search term Sustainability during 2021",sep = "")
## There were 645 hits for the search term Sustainability during 2021
At 10 search results per page, lets divide that by 10 and get our pages variable. Pages start from 0 so subtract 1.
max_pages <- round((hits) / 10 - 1)
Ok, now we can implement the loop specificed in the jsonlite
package documentation.
# store all pages in list
pages <- list()
for(i in 0:max_pages){
sus_df <- fromJSON(paste0(complete_url, "&page=", i)) %>%
data.frame()
message("Retrieving page ", i)
pages[[i+1]] <- sus_df
Sys.sleep(6)
}
# combine into one
sus_df <- rbind_pages(pages)
# preview rows
nrow(sus_df)
## [1] 645
# save df
saveRDS(sus_df, file = "sustainability_climate_change_nyc_2021.RDS")
Reload df so that I don’t have to make API call every time I want to work with the data.
sus_df <- readRDS("sustainability_climate_change_nyc_2021.RDS")
Column headers mostly have response.docs or meta, lets clean that up.
colnames(sus_df)
## [1] "status" "copyright"
## [3] "response.docs.abstract" "response.docs.web_url"
## [5] "response.docs.snippet" "response.docs.lead_paragraph"
## [7] "response.docs.print_section" "response.docs.print_page"
## [9] "response.docs.source" "response.docs.multimedia"
## [11] "response.docs.headline" "response.docs.keywords"
## [13] "response.docs.pub_date" "response.docs.document_type"
## [15] "response.docs.news_desk" "response.docs.section_name"
## [17] "response.docs.subsection_name" "response.docs.byline"
## [19] "response.docs.type_of_material" "response.docs._id"
## [21] "response.docs.word_count" "response.docs.uri"
## [23] "response.meta.hits" "response.meta.offset"
## [25] "response.meta.time"
colnames(sus_df) <- sus_df %>%
colnames() %>%
str_replace_all(c(
"response\\.docs\\._" = "",
"response\\.docs\\." = "",
"response\\.meta\\." = ""))
# preview results
colnames(sus_df)
## [1] "status" "copyright" "abstract" "web_url"
## [5] "snippet" "lead_paragraph" "print_section" "print_page"
## [9] "source" "multimedia" "headline" "keywords"
## [13] "pub_date" "document_type" "news_desk" "section_name"
## [17] "subsection_name" "byline" "type_of_material" "id"
## [21] "word_count" "uri" "hits" "offset"
## [25] "time"
Let’s take a look at the distribution of articles with keywords Sustainability
and Climate Change
in NYC for 2021.
sus_df %>%
group_by(month = month(pub_date, label = T)) %>%
summarise(count = n()) %>%
ggplot(aes(month, count, group = 1, color = count)) +
geom_line() +
labs(y = "Article Count", x = "",
title = "NYT Articles containing Sustainability and Climate Change in 2021",
color = "")
Interestingly almost double the published articles in Oct vs Jan. I wonder why? Perhaps this has to do with more extreme weather events during that time period.
Let’s take a look at the most popular sections in which they where published.
sus_df %>%
group_by(section_name) %>%
summarise(count = n()) %>%
slice_max(count, n = 10) %>%
mutate(section_name=replace(section_name, section_name=="Today's International New York Times", "Today's Int NYT")) %>%
ggplot(aes(reorder(section_name, count), count, fill = count)) +
geom_bar(stat = "identity") +
labs(x ="", y = "Article Count") +
ggtitle("NYT Articles by Section containing Sustainability and Climate Change in 2021") +
coord_flip() +
scale_fill_continuous(name = "") +
theme(plot.title = element_text(hjust = 0.65, vjust=1))
Interestingly, Business Day was the most popular section with articles published containing these search terms, not Climate albeit second.
Let’s take a look at word count
sus_df %>%
ggplot(aes(word_count)) +
geom_histogram(binwidth = 150) +
labs(y = "Article Count", x = "Article per word count",
title = "Distribution of articles by word count")
Looks like average article count is ~1500 words with some extreme outliers with upwards of 10k words. There are also ~25 with 0, perhaps they were visuals classified as articles.
I wonder what median word count by section name article was published in would look like
(word_count_section <- sus_df %>%
group_by(section_name) %>%
summarise(count = n(),
median_word_count = round(median(word_count)),
mean_word_count = round(mean(word_count))) %>%
arrange(desc(count)))
## # A tibble: 30 × 4
## section_name count median_word_count mean_word_count
## <chr> <int> <dbl> <dbl>
## 1 Business Day 114 1452 1456
## 2 Climate 100 1174 1175
## 3 Opinion 83 1294 1588
## 4 U.S. 61 1141 1135
## 5 World 50 1310 1401
## 6 Briefing 44 1326 1368
## 7 Style 17 926 1007
## 8 Arts 15 1227 1459
## 9 The Learning Network 15 1487 2160
## 10 Magazine 14 4348 4456
## # … with 20 more rows
word_count_section %>%
slice_max(count, n = 10 ) %>%
ggplot(aes(reorder(section_name, median_word_count), median_word_count, fill = median_word_count)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_gradient2(high = scales::muted("blue"), mid = scales::muted("red")) +
labs(x = "", y = "Median word count", title = "Median word count by Section Name (Top 10)") +
theme(legend.position = "none") +
geom_text(aes(label = median_word_count,
hjust = 1.2),
color = "#ffffff")
Looks like using the NYT API could be very useful for future data collection. For 2021 at least, articles containing key words climate change and sustainability were most prevalent in October. Would be interesting to see when they were most published over a series of years. Also, articles published are most prevalent in the Business section, even more then the Climate section. However, a full analysis would need to account for the very real possibility that the Business section publishes more articles overall then the climate section, and would therefore have more opportunity to publish articles with these key words. Articles have an average of 1500 words, and Podcasts and Magazine have by far the highest median word count. Would be interesting to build some word clouds with facet mapping to compare most popular words by section name.