DATA607 Assignment 6: Web APIs

Intro
Load JSON Data into Data Frame
Preview Resulting Data Table
Tidy & Transform Data
Preview Transformed Data
Visualization: Top 20 Email Shares by Keyword
Conclusion

Intro

We were tasked to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it to an R dataframe.

I decided to use the Most Popular API. This API returns a list of New York Times Articles based on shares, emails, and views.

Load JSON Data into Data Frame

I will use to NYT API to get the most frequently shared (by email) articles over the last 30 days.

The NYT API returns a maximum of 20 articles per request.

We must paginate these request using the offset parameter.

We must also ensure that we do not exceed the rate limit of 5 requests per second.

# load JSON data into data frame
url = paste0("https://api.nytimes.com/svc/mostpopular/v2/mostemailed/all-sections/30.json?api-key=", api_key)
articles <- fromJSON(url, flatten = TRUE) %>% data.frame()
pages <- floor(articles$num_results[1]/20) 

for(i in 1:pages){
  print(paste0("Loading page ", i, " of ", pages, "." ))
  page <- fromJSON(paste0(url, "&offset=",i*20), flatten = TRUE) %>% data.frame()
  articles <- bind_rows(articles,page) # build data frame incrementally
  Sys.sleep(1) #Stay within usage rate limits
}

## [1] "Loading page 1 of 62."
## [1] "Loading page 2 of 62."
## [1] "Loading page 3 of 62."
## [1] "Loading page 4 of 62."
## [1] "Loading page 5 of 62."
## [1] "Loading page 6 of 62."
## [1] "Loading page 7 of 62."
## [1] "Loading page 8 of 62."
## [1] "Loading page 9 of 62."
## [1] "Loading page 10 of 62."
## [1] "Loading page 11 of 62."
## [1] "Loading page 12 of 62."
## [1] "Loading page 13 of 62."
## [1] "Loading page 14 of 62."
## [1] "Loading page 15 of 62."
## [1] "Loading page 16 of 62."
## [1] "Loading page 17 of 62."
## [1] "Loading page 18 of 62."
## [1] "Loading page 19 of 62."
## [1] "Loading page 20 of 62."
## [1] "Loading page 21 of 62."
## [1] "Loading page 22 of 62."
## [1] "Loading page 23 of 62."
## [1] "Loading page 24 of 62."
## [1] "Loading page 25 of 62."
## [1] "Loading page 26 of 62."
## [1] "Loading page 27 of 62."
## [1] "Loading page 28 of 62."
## [1] "Loading page 29 of 62."
## [1] "Loading page 30 of 62."
## [1] "Loading page 31 of 62."
## [1] "Loading page 32 of 62."
## [1] "Loading page 33 of 62."
## [1] "Loading page 34 of 62."
## [1] "Loading page 35 of 62."
## [1] "Loading page 36 of 62."
## [1] "Loading page 37 of 62."
## [1] "Loading page 38 of 62."
## [1] "Loading page 39 of 62."
## [1] "Loading page 40 of 62."
## [1] "Loading page 41 of 62."
## [1] "Loading page 42 of 62."
## [1] "Loading page 43 of 62."
## [1] "Loading page 44 of 62."
## [1] "Loading page 45 of 62."
## [1] "Loading page 46 of 62."
## [1] "Loading page 47 of 62."
## [1] "Loading page 48 of 62."
## [1] "Loading page 49 of 62."
## [1] "Loading page 50 of 62."
## [1] "Loading page 51 of 62."
## [1] "Loading page 52 of 62."
## [1] "Loading page 53 of 62."
## [1] "Loading page 54 of 62."
## [1] "Loading page 55 of 62."
## [1] "Loading page 56 of 62."
## [1] "Loading page 57 of 62."
## [1] "Loading page 58 of 62."
## [1] "Loading page 59 of 62."
## [1] "Loading page 60 of 62."
## [1] "Loading page 61 of 62."
## [1] "Loading page 62 of 62."

# add row id to utilize less memory when transforming
articles <- rowid_to_column(articles,'id')

Preview Resulting Data Table

print(names(articles))

##  [1] "id"                     "status"                
##  [3] "copyright"              "num_results"           
##  [5] "results.url"            "results.count_type"    
##  [7] "results.column"         "results.section"       
##  [9] "results.byline"         "results.title"         
## [11] "results.abstract"       "results.published_date"
## [13] "results.source"         "results.des_facet"     
## [15] "results.org_facet"      "results.per_facet"     
## [17] "results.geo_facet"      "results.media"

datatable(head(select(articles, results.title, results.source, results.des_facet, results.published_date)), options = list(filter = FALSE))

Tidy & Transform Data

I would like to do an exploration to find out the tags/keywords with the highest proportion of shares by email. The descriptive tags for each article is stored in a character vector with multiple elements in results.des_facet. The keywords will need to be separated and then placed in a long format to facilitate downstream analysis.

keywords <- select(articles, `id`, `results.des_facet`)  %>% 
  group_by(`id`) %>% summarize(keyword = paste(unlist( `results.des_facet` ),collapse = ","))  %>%
  cSplit("keyword", sep = ",", direction = "long")

Preview Transformed Data

datatable(keywords, options = list(filter = FALSE))

Visualization: Top 20 Email Shares by Keyword

  group_by(keywords, keyword) %>%
  summarize(count = n())  %>%
  arrange(desc(count)) %>%
  top_n(20,count) %>%
  ggplot(aes(x = reorder(keyword, count), y = count,  fill=keyword, label = count)) + 
  geom_histogram( stat='identity', show.legend = F ) + 
  geom_text(size = 2, position = position_stack(vjust = 0.5)) + 
  coord_flip() +  
  labs( title = "Top 20 Email Shares by Keyword", x = "Keywords/Tags", y = "Frequency" ) +
  theme(plot.title = element_text(hjust = 2))

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Conclusion

The results show that the top NYT Articles shared by email were dominated by politics. 3 of the top 20 tags were related to Gun violence. It would be great for more media houses to provide such an API service so that we can compare and contrast trending feeds across various sources.