Problem Statement

Extracting content from diversified web resources, cleaning up the raw data, preparing them for the statistical analysis and actually performing the analysis it is far from being a simple task. The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/api . Task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.


Acceptance Criteria

Preparing Data

  • Choose one of the New York Times APIs, request API key
  • Construct an interface in R to read in the JSON data
  • Transform data to an R dataframe

Reproducibility

  • Using R Markdown text and headers

Workflow

  • Included a brief description of the assigned problem.
  • Included an overview of your approach.
  • Explained your reasoning.
  • Provided a conclusion (including any findings and recommendations).

Approach


Implementation

Load required libraries

library(DT)
library(jsonlite)
library(tidyjson)
library(dplyr)
library(tidyr)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
library(tm)
library(textclean)
library(lares)

Load NY Times API credentials from secret management

baseurl<-get_creds()$`nyt.api`$baseurl
apikey <-get_creds()$`nyt.api`$apikey

Generic function to fetch the data from Web API

get_data<- function(section) {
  url<-paste(baseurl,section,".json?api-key=",sep = "")
request <- fromJSON(URLencode(paste0(url, apikey)))
stories <- request$results
newdata<-data.frame(Subsection=stories$subsection, 
                    Title=stories$title, 
                    Abstract=stories$abstract, 
                    Byline=stories$byline, 
                    Created=as.Date(stories$created_date), 
                    'Short URL'=stories$short_url, stringsAsFactors = FALSE);
  return(newdata)
}

Generic function to plot Webcloud for given text

get_wordcloud<-function(dataframe) {
abstract<-dataframe$Abstract
words <- Corpus(VectorSource(abstract))
words <- words %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(replace_contraction) %>%
tm_map(replace_curly_quote) %>%
tm_map(stripWhitespace)
words <- tm_map(words, content_transformer(tolower))
words <- tm_map(words, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(words) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)
df=df[-1,]
set.seed(1234) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, min.freq = 1,
          max.words=150, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))
}

Data Analysis

Top Stories

These are sections available to select on Top Stories API : Automobiles, Books, Business, Health, Movies, Politics, Science, Sports, Technology, Travel, US and World.
API doesnt allow fetching top stories for all sections so only selected sections are used for data analysis

Automobiles

df<-get_data("automobiles")

Plot a table

datatable(df)

Draw wordcloud

get_wordcloud(df)


Books

df<-get_data("books")

Plot a

datatable(df)

Draw wordcloud

get_wordcloud(df)


Sports

df<-get_data("sports")

Plot a table

datatable(df)

Draw wordcloud

get_wordcloud(df)

Health

df<-get_data("health")

Plot a table

datatable(df)

Draw wordcloud

get_wordcloud(df)

Conclusion

  • The New York Times Top Stories API provides top stories in various sections.
  • Top stories from automobiles, book and Sports sections are analyzed in this assignment
  • Integration using APIs gives a solid foundation with dataset to analyzed further
  • Additional features on Web API’s including specifying query parameters, retrieved in JSON format are very helpful
  • Developer version of API’s has limitations of fetching full dataset from various sections so only selected sections(Automobiles, Books, Sports, Health) are used in this analysis