DATA 607 Week 9 Assignment: Web APIs

Data Retrieval and Analysis

Quick Test

Retrieve search results with no parameters (with the exception of search query term).

# Set up URL to access NYT API
api_key <- "&api-key=615feb0a12634c4789194e29202022ee"
search_criteria <- "weehawken"
main_url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?q="

# Retrieve data from NYT
request <- getURL(paste0(main_url, search_criteria, api_key))

# Parse article data
articles <- request %>% 
  enter_object("response") %>%
  enter_object("docs") %>% 
  gather_array("document.id") %>%
  spread_values(pubdate = jstring("pub_date"), 
                snippet = jstring("snippet"),
                url = jstring("web_url"))

# Parse headline data
headlines <- request %>% 
  enter_object("response") %>%
  enter_object("docs") %>% 
  gather_array("document.id") %>%
  enter_object("headline") %>% 
  spread_values(mainheadline = jstring("main"))

# Format for display
articles <- articles %>% 
  left_join(headlines, by = "document.id") %>%
  transform(Publication.Date = format(as.Date(pubdate), format="%B %d, %Y")) %>% 
  select(Publication.Date,
         Main.Headline = mainheadline, 
         Snippet = snippet,
         URL = url)

Publication.Date	Main.Headline	Snippet	URL
January 11, 2017	Living in Weehawken, N.J.	The recent development on the waterfront , with its striking views and appealing prices, is attracting those who work in New York City….	https://www.nytimes.com/slideshow/2017/01/11/realestate/living-in-weehawken-nj.html
January 11, 2017	Weehawken, N.J.: A Cliffside Town With an Easy Commute	Recent development on the waterfront with striking views and appealing prices is attracting those who work in New York City….	https://www.nytimes.com/2017/01/11/realestate/weehawken-nj-a-cliffside-town-with-an-easy-commute.html
August 09, 2015	Restaurant Review: Charritos in Weehawken Is a Family Affair	At Charritos in Weehawken, several aunts and the bosss mother help make this spot ideal for enjoying Oaxacan specialties….	https://www.nytimes.com/2015/08/09/nyregion/restaurant-review-charritos-in-weehawken-is-a-family-affair.html
April 10, 1876	LETTERS TO THE EDITOR.; TILDEN’S GRAB. ALEXANDER HAMILTON.	NA	https://query.nytimes.com/gst/abstract.html?res=9F06E2D7143AE63BBC4852DFB266838D669FDE
August 20, 2012	Tom Cotter, President of the Group Bringing Formula One to New Jersey, Resigns	The marketing executive said he would return to manage his various motorsports-related businesses in North Carolina….	https://wheels.blogs.nytimes.com/2012/08/20/tom-cotter-president-of-the-group-bringing-formula-one-to-new-jersey-resigns/
April 01, 2010	UBS Brokerage Has No Plans For Split	UBS Wealth Management Americas will not change its name back to PaineWebber, nor will it split off from the embattled Swiss banking giant, the unit’s top executive told UBS brokers on Wednesday, Reuters reported….	https://dealbook.nytimes.com/2010/04/01/ubs-brokerage-has-no-plans-for-split/
March 06, 2014	Lin-Manuel Miranda’s ‘Hamilton’ Heading to Public Theater	The new musical “Hamilton,” by the creator of “In the Heights,” is having its world premiere at the Public Theater next winter….	https://artsbeat.blogs.nytimes.com/2014/03/06/lin-manuel-mirandas-hamilton-heading-to-public-theater/
April 10, 2013	New Jersey Condo Market Heats Up as Demand Surges	The Henley-on-the-Hudson project in Weehawken is part of New Jerseys improving condominium market….	https://www.nytimes.com/2013/04/10/realestate/commercial/new-jersey-condo-market-heats-up.html
August 06, 2011	Killing Jeff Davis	A crack Union regiment hits its mark….	https://opinionator.blogs.nytimes.com/2011/08/06/killing-jeff-davis/
September 13, 2009	These Apartment Hunters Are the Happy Renters	After living in a condominium in Weehawken, N.J., Leslie and Andy LeCount regretted buying it….	https://www.nytimes.com/2009/09/13/realestate/13HUNT.html

I have started with jsonlite package, but once I realized that headline, keywords, and others are nested objects, it was easier to switch to tidyjson. It seems easier to gather necessary data directly from the JSON structure than manipulating the nested data frames generated by fromJSON.

Additional Analysis

Retrieve 10 pages (100 hits) and analyze timeline of publication dates and most commonly used keywords.

# Set fields to return
fields <- "&fl=pub_date,keywords"

# Set up empty data frames to gather results from individual pages
dates <- data.frame(document.id=integer(), pubdate=character())
keywords <- data.frame(document.id=integer(), keyword=character())

# Loop through 10 pages
for(i in 0:9) {
  # Set page number
  page <- paste0("?page=",i)
  
  # Retrieve data from NYT
  request <- getURL(paste0(main_url, search_criteria, page, fields, api_key))

  # Parse dates data
  dates_i <- request %>% 
    enter_object("response") %>%
    enter_object("docs") %>% 
    gather_array("document.id") %>%
    spread_values(pubdate = jstring("pub_date")) %>% 
    mutate(document.id = document.id + i*10,
         pubdate = as.Date(pubdate))
  dates <- rbind(dates, dates_i)

  # Parse keyword data
  keywords_i <- request %>% 
    enter_object("response") %>%
    enter_object("docs") %>% 
    gather_array("document.id") %>%
    enter_object("keywords") %>% 
    gather_array() %>% 
    spread_values(keyword = jstring("value")) %>% 
    transform(document.id = document.id + i*10) %>% 
    select(document.id, keyword)
  keywords <- rbind(keywords, keywords_i)
  
  # It seems that quick consecutive requests are not handled properly
  # Insert pause between each request
  # Since only looking at 10 pages, a 2 second delay is not too costly
  Sys.sleep(2)
}

Find and plot 20 most common keywords.

# Get a list of top 20 keywords
top_keywords <- keywords %>% 
  group_by(keyword) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count)) %>% 
  top_n(20)

# Plot top keywords
ggplot(data = top_keywords, aes(x = keyword, y = count)) + 
  geom_bar(stat="identity") +
  coord_flip() +
  labs(x = "", y = "")

The list includes references to railroads, ferries and tunnels - not surprising for Weehawken as it is home to one of two tunnels, Lincoln Tunnel, connecting NJ and New York City. It was also terminal destination for railroads. From here goods could be loaded onto barges for delivery to New York City or, in the opposite direction, unloaded from barges for exporting to the rest of the country). The list also includes several non-descriptive keywords such as United States, NJ, New York City and Weehawken.

Plot number of articles by publication month.

# Prepare dates for plotting
dates2 <- dates %>% 
  transform(month = substr(pubdate, 1, 7)) %>% 
  group_by(month) %>% 
  summarise(count = n()) %>% 
  select(month, count)

# Plot monthly article counts
ggplot(data = dates2, aes(x = month, y = count, group = 1)) + 
  geom_col() +
  coord_flip() + 
  labs(x = "", y = "")

Please note that the months are not spaced proportional to the timeline.

Something clearly happened in December 1937. Looking into it, I discovered for myself that the Lincoln Tunnel, connecting Weehawken and New York City, was opened on December 21, 1937. Below is the opening ceremony on the New York City side.

Weehawken is not associated in my mind with coal, so I decided to take a look at why it comes up so often in keywords. Breakdown top keywords by publication month accounting for number of articles for each.

# Isolate publication month and join it with top 20 keywords
keywords_by_date <- dates %>% 
  transform(month = substr(pubdate, 1, 7)) %>% 
  select(document.id, month) %>% 
  inner_join(keywords, by = "document.id") %>% 
  group_by(keyword, month) %>% 
  summarise(count = n()) %>% 
  inner_join(top_keywords, by = "keyword") %>% 
  select(keyword, month, count = count.x)

# Plot keywords by publication month and factor in number of articles
ggplot(data = keywords_by_date, aes(x = month, y = keyword)) + 
  geom_point(aes(size = count)) +
  labs(x = "", y = "") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

It looks like all articles mentioning coal were published in November 1946. Looking into it further it turned out that coal was mentioned in only several articles (nothing particularly interesting) and it was also listed several times per article racking up keyword count.

DATA 607 Week 9 Assignment: Web APIs

Ilya Kats

April 1, 2017

Assignment Objective and Summary

Required Libraries

Data Retrieval and Analysis

Quick Test

Additional Analysis

Conclusion