library(dotenv)
library(jsonlite)
library(tidyverse)
library(lubridate)

Overview

Demonstration utilizing the NY Times Article API, and coercing the search results into a data frame. The key words will be climate+change and sustainability during the year 2021 in NYC.

Load in data

# load up hidden api key
article_api <- Sys.getenv("ARTICLE_API")
#semantic_api <- Sys.getenv("SEMANTIC_API")

# set base url
base_url_art <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q="
#base_url_sem <- "http://api.nytimes.com/svc/semantic/v2/concept/name"

# set parameters
term <- "sustainability+climate+change"
fq <- "&fq=glocation='New York City'"
begin_date <- "20210101"
end_date <- "20211231"
complete_url <- paste0(base_url_art,term,fq,"&begin_date=",begin_date,"&end_date=",end_date,"facet_filter=true&api-key=",article_api,sep = "")


# import dataset to R
sus <- fromJSON(complete_url) 

# view how many hits
sus$response$meta$hits
## [1] 645

This only returns the first 10 results, I would like to return all 645 results in a data frame. This is where the jsonlite documentation came in handy. It explains Paging with jsonlite and how to combine pages of data. The following code loop was adapted directly from the documentation.

Combine all pages and create dataframe

In order to combine all the requests during 2021, I need to see how many hits there were on the search term:

hits <- sus$response$meta$hits
cat("There were ",hits," hits for the search term Sustainability during 2021",sep = "")
## There were 645 hits for the search term Sustainability during 2021

At 10 search results per page, lets divide that by 10 and get our pages variable. Pages start from 0 so subtract 1.

max_pages <- round((hits) / 10 - 1)

Ok, now we can implement the loop specificed in the jsonlite package documentation.

# store all pages in list
pages <- list()
for(i in 0:max_pages){
  sus_df <- fromJSON(paste0(complete_url, "&page=", i)) %>% 
    data.frame()
  message("Retrieving page ", i)
  pages[[i+1]] <- sus_df
  Sys.sleep(6)
}
# combine into one
sus_df <- rbind_pages(pages)

# preview rows
nrow(sus_df)
## [1] 645
# save df
saveRDS(sus_df, file = "sustainability_climate_change_nyc_2021.RDS")

Reload df so that I don’t have to make API call every time I want to work with the data.

sus_df <- readRDS("sustainability_climate_change_nyc_2021.RDS")

Rename column headers

Column headers mostly have response.docs or meta, lets clean that up.

colnames(sus_df)
##  [1] "status"                         "copyright"                     
##  [3] "response.docs.abstract"         "response.docs.web_url"         
##  [5] "response.docs.snippet"          "response.docs.lead_paragraph"  
##  [7] "response.docs.print_section"    "response.docs.print_page"      
##  [9] "response.docs.source"           "response.docs.multimedia"      
## [11] "response.docs.headline"         "response.docs.keywords"        
## [13] "response.docs.pub_date"         "response.docs.document_type"   
## [15] "response.docs.news_desk"        "response.docs.section_name"    
## [17] "response.docs.subsection_name"  "response.docs.byline"          
## [19] "response.docs.type_of_material" "response.docs._id"             
## [21] "response.docs.word_count"       "response.docs.uri"             
## [23] "response.meta.hits"             "response.meta.offset"          
## [25] "response.meta.time"
colnames(sus_df) <- sus_df %>% 
  colnames() %>% 
  str_replace_all(c(
    "response\\.docs\\._" = "", 
    "response\\.docs\\." = "",
    "response\\.meta\\." = ""))

# preview results
colnames(sus_df)
##  [1] "status"           "copyright"        "abstract"         "web_url"         
##  [5] "snippet"          "lead_paragraph"   "print_section"    "print_page"      
##  [9] "source"           "multimedia"       "headline"         "keywords"        
## [13] "pub_date"         "document_type"    "news_desk"        "section_name"    
## [17] "subsection_name"  "byline"           "type_of_material" "id"              
## [21] "word_count"       "uri"              "hits"             "offset"          
## [25] "time"

Some Visuals

Let’s take a look at the distribution of articles with keywords Sustainability and Climate Change in NYC for 2021.

sus_df %>% 
  group_by(month = month(pub_date, label = T)) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(month, count, group = 1, color = count)) +
    geom_line() +
    labs(y = "Article Count", x = "",
         title = "NYT Articles containing Sustainability and Climate Change in 2021",
         color = "")

Interestingly almost double the published articles in Oct vs Jan. I wonder why? Perhaps this has to do with more extreme weather events during that time period.

Let’s take a look at the most popular sections in which they where published.

sus_df %>% 
  group_by(section_name) %>% 
  summarise(count = n()) %>%
  slice_max(count, n = 10) %>% 
  mutate(section_name=replace(section_name, section_name=="Today's International New York Times", "Today's Int NYT")) %>%
  ggplot(aes(reorder(section_name, count), count, fill = count)) +
    geom_bar(stat = "identity") +
    labs(x ="", y = "Article Count") +
    ggtitle("NYT Articles by Section containing Sustainability and Climate Change in 2021") +
    coord_flip() +
    scale_fill_continuous(name = "") +
    theme(plot.title = element_text(hjust = 0.65, vjust=1)) 

Interestingly, Business Day was the most popular section with articles published containing these search terms, not Climate albeit second.

Let’s take a look at word count

sus_df %>%
  ggplot(aes(word_count)) + 
    geom_histogram(binwidth = 150) +
    labs(y = "Article Count", x = "Article per word count",
         title = "Distribution of articles by word count")

Looks like average article count is ~1500 words with some extreme outliers with upwards of 10k words. There are also ~25 with 0, perhaps they were visuals classified as articles.

I wonder what median word count by section name article was published in would look like

(word_count_section <- sus_df %>% 
  group_by(section_name) %>% 
  summarise(count = n(),
            median_word_count = round(median(word_count)),
            mean_word_count = round(mean(word_count))) %>% 
  arrange(desc(count)))
## # A tibble: 30 × 4
##    section_name         count median_word_count mean_word_count
##    <chr>                <int>             <dbl>           <dbl>
##  1 Business Day           114              1452            1456
##  2 Climate                100              1174            1175
##  3 Opinion                 83              1294            1588
##  4 U.S.                    61              1141            1135
##  5 World                   50              1310            1401
##  6 Briefing                44              1326            1368
##  7 Style                   17               926            1007
##  8 Arts                    15              1227            1459
##  9 The Learning Network    15              1487            2160
## 10 Magazine                14              4348            4456
## # … with 20 more rows
word_count_section %>%
  slice_max(count, n = 10 ) %>% 
  ggplot(aes(reorder(section_name, median_word_count), median_word_count, fill = median_word_count)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    scale_fill_gradient2(high = scales::muted("blue"), mid = scales::muted("red")) +
    labs(x = "", y = "Median word count", title = "Median word count by Section Name (Top 10)") +
    theme(legend.position = "none") +
    geom_text(aes(label = median_word_count, 
                    hjust = 1.2),
                    color = "#ffffff")

Conclusion & Next Steps

Looks like using the NYT API could be very useful for future data collection. For 2021 at least, articles containing key words climate change and sustainability were most prevalent in October. Would be interesting to see when they were most published over a series of years. Also, articles published are most prevalent in the Business section, even more then the Climate section. However, a full analysis would need to account for the very real possibility that the Business section publishes more articles overall then the climate section, and would therefore have more opportunity to publish articles with these key words. Articles have an average of 1500 words, and Podcasts and Magazine have by far the highest median word count. Would be interesting to build some word clouds with facet mapping to compare most popular words by section name.