DATA607 Week 9

Overview - Web APIs

The New York Times web site provides a rich set of APIs, including the TimesTags API. For this assignment, I have chose to work with this API which allows you to mine the New York Times tag set. From your query, the response provided is a ranked list of terms.

I will read in the JSON data from this API for a couple different queries and store the data in R dataframes.

structure: ?query={search-string}&[optional-param1=value1]&[.]&api-key={your-api-key}

The tag dictionaries that are searchable [&filter={dictionary}] include: (Des) - descriptive terms (Geo) - geographical unit (Org) - organizations (Per) - personal names

HTTR

Using the httr library, I was able to test one of the examples from the Times Tags documentation. This example is a quick query across all dictionaries for the letters ‘pal’.

I have requested an api-key for use of this API which is used throughout the code.

#load required package
library(httr)
library(knitr)

## Warning: package 'knitr' was built under R version 3.3.3

library(kableExtra)


exampleurl='http://api.nytimes.com/svc/suggest/v1/timestags?query=pal&api-key=7178bcfcb8b24ba3bdf1d837327dfd79'
pal <- GET(exampleurl)

#check the status to be sure the call worked
pal$status_code

## [1] 200

#status is 200 which means it worked

#view the content
kable(content(pal, "parse"))%>%kable_styling("striped", full_width = F)

x
[“pal”,[“Palestinians (Des)”,“Palestine Liberation Organization (Org)”,“Paleontology (Des)”,“Palestinian Authority (Org)”,“Palin, Sarah (Per)”,“Cerebral Palsy (Des)”,“Palm Beach (Fla) (Geo)”,“Palaces and Castles (Des)”,“Palmer, Arnold (Per)”,“Paltrow, Gwyneth (Per)”]]

Queries from the Times Tags API

I wanted get additional data from the API and test out the different dictionaries and parameters.

#here is the base url for the Times Tags API with my API key and a placeholder for the query text ('%s)
baseurl = "http://api.nytimes.com/svc/suggest/v1/timestags%s&api-key=7178bcfcb8b24ba3bdf1d837327dfd79"

#Using sprintf, I am able to paste the query into the baseurl in place of %s

#This query searches for personal names including "data" and limits the results to 20.
data_per = sprintf(baseurl, "?query=data&filter=(Per)&max=20")

#This query searches all dictionaries for "data" and limits the results to 20.
data_all = sprintf(baseurl, "?query=data&max=20")

#Additional examples
france = sprintf(baseurl, "?query=france&filter=(Geo)&max=10")
pres_des= sprintf(baseurl, "?query=pres&filter=(Des)&max=10")
soc_des= sprintf(baseurl, "?query=soc&filter=(Des)&max=10")
hea_org= sprintf(baseurl, "?query=hea&filter=(Org)")

Transform into an R dataframe

The data from the ‘pal’ example does not look too easy to work with, so I used the jsonlite package on some of the new queries.

#load required package
library(jsonlite)

#fromJSON turns the JSON code into an R list. The search criteria is in the first element of the list and the results are in the second element.
kable(fromJSON(data_all))%>%kable_styling("striped", full_width = F)

x
data

x
Data-Mining and Database Marketing (Des)
Falsification of Data (Des)
Data Storage (Des)
Data Centers (Des)
Dataclysm: Who We Are When We Think No One’s Looking (Book) (Ttl)
Big Data (Movie) (Ttl)
Data and Goliath: The Hidden Battle to Collect Your Data and Control Your World (Book) (Ttl)
Data Call Technologies Inc. (Org)
Data for the People: How to Make Our Post-Privacy Economy Work for You (Book) (Ttl)
Data I/O Corporation (Org)
Data Storage Corporation (Org)
Data-ism: The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else (Book) (Ttl)
Datamill Media Corporation (Org)
Datasea Inc. (Org)
Datawatch Corporation (Org)
Exploding Data: Reclaiming Our Cyber Security in the Digital Age (Book) (Ttl)
Habeas Data: Privacy Vs. the Rise of Surveillance Tech (Book) (Ttl)
Keeping Track: Personal Informatics, Self-Regulation and the Data-Driven Life (Book) (Ttl)
Our Bodies, Our Data: How Companies Make Billions Selling Our Medical Records (Book) (Ttl)
Small Wars, Big Data: The Information Revolution in Modern Conflict (Book) (Ttl)

#create a dataframe from the second element in the list 
data <- data.frame(fromJSON(data_all)[[2]])
names(data) <- "Tags_incl_'data'"
kable(head(data))%>%kable_styling("striped", full_width = F)

Tags_incl_‘data’
Data-Mining and Database Marketing (Des)
Falsification of Data (Des)
Data Storage (Des)
Data Centers (Des)
Dataclysm: Who We Are When We Think No One’s Looking (Book) (Ttl)
Big Data (Movie) (Ttl)

Creating another R dataframe from query results from the API

Here I want to nest functions for cleaner code. I will create a dataframe of the top 200 organizations with ‘llc’ in the name, based on how frequently they are used in the New York Times.

top200 <- data.frame(fromJSON(sprintf(baseurl, "?query=llc&filter=(Org)&max=200"))[[2]])
names(top200) <- "Top200 LLC's"
kable(head(top200))%>%kable_styling("striped", full_width = F)

Top200 LLC’s
United States Steel Corporation (Org)
KPMG (Org)
GMAC LLC (Org)
Fortress Investment Group L.L.C (Org)
Breitbart News Network LLC (Org)
One Grand LLC (Org)