The objective of this assignment is to retrieve information using one of The New York Times APIs and store it in a data frame. I have chosen to work with the Article Search API. Considering that NYT’s archive goes back to September 1851, I was curious to access information about my new hometown of Weehawken, NJ (incorporated 1859).
library(RCurl)
library(tidyjson)
library(dplyr)
library(tidyr)
library(ggplot2)
Retrieve search results with no parameters (with the exception of search query term).
# Set up URL to access NYT API
api_key <- "&api-key=615feb0a12634c4789194e29202022ee"
search_criteria <- "weehawken"
main_url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q="
# Retrieve data from NYT
request <- getURL(paste0(main_url, search_criteria, api_key))
# Parse article data
articles <- request %>%
enter_object("response") %>%
enter_object("docs") %>%
gather_array("document.id") %>%
spread_values(pubdate = jstring("pub_date"),
snippet = jstring("snippet"),
url = jstring("web_url"))
# Parse headline data
headlines <- request %>%
enter_object("response") %>%
enter_object("docs") %>%
gather_array("document.id") %>%
enter_object("headline") %>%
spread_values(mainheadline = jstring("main"))
# Format for display
articles <- articles %>%
left_join(headlines, by = "document.id") %>%
transform(Publication.Date = format(as.Date(pubdate), format="%B %d, %Y")) %>%
select(Publication.Date,
Main.Headline = mainheadline,
Snippet = snippet,
URL = url)
| Publication.Date | Main.Headline | Snippet | URL |
|---|---|---|---|
| January 11, 2017 | Living in Weehawken, N.J. | The recent development on the waterfront , with its striking views and appealing prices, is attracting those who work in New York City…. | https://www.nytimes.com/slideshow/2017/01/11/realestate/living-in-weehawken-nj.html |
| January 11, 2017 | Weehawken, N.J.: A Cliffside Town With an Easy Commute | Recent development on the waterfront with striking views and appealing prices is attracting those who work in New York City…. | https://www.nytimes.com/2017/01/11/realestate/weehawken-nj-a-cliffside-town-with-an-easy-commute.html |
| August 09, 2015 | Restaurant Review: Charritos in Weehawken Is a Family Affair | At Charritos in Weehawken, several aunts and the bosss mother help make this spot ideal for enjoying Oaxacan specialties…. | https://www.nytimes.com/2015/08/09/nyregion/restaurant-review-charritos-in-weehawken-is-a-family-affair.html |
| April 10, 1876 | LETTERS TO THE EDITOR.; TILDEN’S GRAB. ALEXANDER HAMILTON. | NA | https://query.nytimes.com/gst/abstract.html?res=9F06E2D7143AE63BBC4852DFB266838D669FDE |
| August 20, 2012 | Tom Cotter, President of the Group Bringing Formula One to New Jersey, Resigns | The marketing executive said he would return to manage his various motorsports-related businesses in North Carolina…. | https://wheels.blogs.nytimes.com/2012/08/20/tom-cotter-president-of-the-group-bringing-formula-one-to-new-jersey-resigns/ |
| April 01, 2010 | UBS Brokerage Has No Plans For Split | UBS Wealth Management Americas will not change its name back to PaineWebber, nor will it split off from the embattled Swiss banking giant, the unit’s top executive told UBS brokers on Wednesday, Reuters reported…. | https://dealbook.nytimes.com/2010/04/01/ubs-brokerage-has-no-plans-for-split/ |
| March 06, 2014 | Lin-Manuel Miranda’s ‘Hamilton’ Heading to Public Theater | The new musical “Hamilton,” by the creator of “In the Heights,” is having its world premiere at the Public Theater next winter…. | https://artsbeat.blogs.nytimes.com/2014/03/06/lin-manuel-mirandas-hamilton-heading-to-public-theater/ |
| April 10, 2013 | New Jersey Condo Market Heats Up as Demand Surges | The Henley-on-the-Hudson project in Weehawken is part of New Jerseys improving condominium market…. | https://www.nytimes.com/2013/04/10/realestate/commercial/new-jersey-condo-market-heats-up.html |
| August 06, 2011 | Killing Jeff Davis | A crack Union regiment hits its mark…. | https://opinionator.blogs.nytimes.com/2011/08/06/killing-jeff-davis/ |
| September 13, 2009 | These Apartment Hunters Are the Happy Renters | After living in a condominium in Weehawken, N.J., Leslie and Andy LeCount regretted buying it…. | https://www.nytimes.com/2009/09/13/realestate/13HUNT.html |
I have started with jsonlite package, but once I realized that headline, keywords, and others are nested objects, it was easier to switch to tidyjson. It seems easier to gather necessary data directly from the JSON structure than manipulating the nested data frames generated by fromJSON.
Retrieve 10 pages (100 hits) and analyze timeline of publication dates and most commonly used keywords.
# Set fields to return
fields <- "&fl=pub_date,keywords"
# Set up empty data frames to gather results from individual pages
dates <- data.frame(document.id=integer(), pubdate=character())
keywords <- data.frame(document.id=integer(), keyword=character())
# Loop through 10 pages
for(i in 0:9) {
# Set page number
page <- paste0("?page=",i)
# Retrieve data from NYT
request <- getURL(paste0(main_url, search_criteria, page, fields, api_key))
# Parse dates data
dates_i <- request %>%
enter_object("response") %>%
enter_object("docs") %>%
gather_array("document.id") %>%
spread_values(pubdate = jstring("pub_date")) %>%
mutate(document.id = document.id + i*10,
pubdate = as.Date(pubdate))
dates <- rbind(dates, dates_i)
# Parse keyword data
keywords_i <- request %>%
enter_object("response") %>%
enter_object("docs") %>%
gather_array("document.id") %>%
enter_object("keywords") %>%
gather_array() %>%
spread_values(keyword = jstring("value")) %>%
transform(document.id = document.id + i*10) %>%
select(document.id, keyword)
keywords <- rbind(keywords, keywords_i)
# It seems that quick consecutive requests are not handled properly
# Insert pause between each request
# Since only looking at 10 pages, a 2 second delay is not too costly
Sys.sleep(2)
}
Find and plot 20 most common keywords.
# Get a list of top 20 keywords
top_keywords <- keywords %>%
group_by(keyword) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(20)
# Plot top keywords
ggplot(data = top_keywords, aes(x = keyword, y = count)) +
geom_bar(stat="identity") +
coord_flip() +
labs(x = "", y = "")
The list includes references to railroads, ferries and tunnels - not surprising for Weehawken as it is home to one of two tunnels, Lincoln Tunnel, connecting NJ and New York City. It was also terminal destination for railroads. From here goods could be loaded onto barges for delivery to New York City or, in the opposite direction, unloaded from barges for exporting to the rest of the country). The list also includes several non-descriptive keywords such as United States, NJ, New York City and Weehawken.
Plot number of articles by publication month.
# Prepare dates for plotting
dates2 <- dates %>%
transform(month = substr(pubdate, 1, 7)) %>%
group_by(month) %>%
summarise(count = n()) %>%
select(month, count)
# Plot monthly article counts
ggplot(data = dates2, aes(x = month, y = count, group = 1)) +
geom_col() +
coord_flip() +
labs(x = "", y = "")
Please note that the months are not spaced proportional to the timeline.
Something clearly happened in December 1937. Looking into it, I discovered for myself that the Lincoln Tunnel, connecting Weehawken and New York City, was opened on December 21, 1937. Below is the opening ceremony on the New York City side.
Weehawken is not associated in my mind with coal, so I decided to take a look at why it comes up so often in keywords. Breakdown top keywords by publication month accounting for number of articles for each.
# Isolate publication month and join it with top 20 keywords
keywords_by_date <- dates %>%
transform(month = substr(pubdate, 1, 7)) %>%
select(document.id, month) %>%
inner_join(keywords, by = "document.id") %>%
group_by(keyword, month) %>%
summarise(count = n()) %>%
inner_join(top_keywords, by = "keyword") %>%
select(keyword, month, count = count.x)
# Plot keywords by publication month and factor in number of articles
ggplot(data = keywords_by_date, aes(x = month, y = keyword)) +
geom_point(aes(size = count)) +
labs(x = "", y = "") +
theme(axis.text.x=element_text(angle=90, hjust=1))
It looks like all articles mentioning coal were published in November 1946. Looking into it further it turned out that coal was mentioned in only several articles (nothing particularly interesting) and it was also listed several times per article racking up keyword count.
I have practiced retrieving information using API, including specifying query parameters, as well as manipulating data retrieved in JSON format. I have also learned that Lincoln Tunnel is celebrating 80 years in 2017.