Task

The New York Times web site provides a rich set of APIs, as described here: http://developer.nytimes.com/docs

You’ll need to start by signing up for an API key.

Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it to an R dataframe.

Get Data from API

I chose to get a key for the Article Search API. First, I need to specify a query. Since I studied philosophy in graduate school, I am going to search for “philosophy” as the keyword from January 01, 2016 through October 01, 2016.

library(jsonlite)
library(tidyr)
library(dplyr)
apikey<-"?api-key=046721d6539e4338a818eb9e1d199ac9"

#query - search for philosophy articles
q<-"&q=philosophy"

#begin date
begin_date<-"&begin_date=20160101"

#end date
end_date<-"&end_date=20161001"

#sort
sort<-"&sort=oldest"

#page
page<-"&page=0"

#get documents
docs<-fromJSON(
    paste0("https://api.nytimes.com/svc/search/v2/articlesearch.json"
           ,apikey
           ,q
           ,begin_date
           ,end_date
           ,sort
           ,page
            )
          )

#see number of hits
docs$response$meta$hits

I see that I have 1336 hits. So I am going to create a loop to pull in all of those hits into a data frame. Since I have 10 responses per page, to get all 1336 hits, I need to loop from 0 to 133 times (133*10 = 1330). So as to not have to repull the data, I save the data into a .csv file for future use.

#initialize df
df<-c()

#loop
for(i in 0: 133){

#page
page<-paste0("&page=",i)

#try
try(
  {

  #get documents
  docs<-fromJSON(
      paste0("https://api.nytimes.com/svc/search/v2/articlesearch.json"
             ,apikey
             ,q
             ,begin_date
             ,end_date
             ,sort
             ,page
              )
            )
    
  #get data.frames
  temp<-docs$response$docs
  headline<-docs$response$docs$headline
  
  #select columns and remove row.names
  temp<-select(temp,web_url,snippet,lead_paragraph,abstract,print_page,source,pub_date,document_type,news_desk,section_name,subsection_name,type_of_material,word_count)
  headline<-select(headline, main, print_headline)
  
  temp<-data.frame(temp, row.names = NULL, stringsAsFactors = FALSE)
  headline<-data.frame(headline,row.names = NULL, stringsAsFactors = FALSE)
  
  #combine into one data.frame
  temp<-cbind(temp,headline)
  
  #bind to previous data
  df<-rbind(df,temp)
  
  #sleep-5 seconds to comply with rate limit
  Sys.sleep(5)

}, silent = TRUE)


#print i and rows to monitor progress
print(c(i, nrow(df)))

}

#view column names
data.frame(names(df))

#write results to csv to save for use
write.csv(df,file = "C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/assignment9_df.csv")

Due to errors that would occur and stop my code (e.g., accessing a forbidden directory), I had to wrap my code in a try statement. And to not hit the server too many times and get another error code due to rate limitations, I added a 5 second crawl delay and that solved the error. Also, I only returned 960 results in the code above because I had hit my rate limit for the day (i.e., 1000 returns. I had used 40 already for getting the code running and tested).

The dataframe contains 960 observations with 15 columns. The column names are:

  1. web_url
  2. snippet
  3. lead_paragraph
  4. abstract
  5. print_page
  6. source
  7. pub_date
  8. document_type
  9. news_desk
  10. section_name
  11. subsection_name
  12. type_of_material
  13. word_count
  14. main
  15. print_headline

Quick Analysis

The dates range from January 1, 2016 through July 17, 2016. Most of the results are articles (873), followed by blogposts (79).

df<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/assignment9_df.csv", stringsAsFactors = FALSE)
df$X<-NULL

#see date range
date_range<-df %>% select(pub_date) %>% arrange(pub_date)
head(date_range)
##               pub_date
## 1 2016-01-01T00:00:00Z
## 2 2016-01-01T00:00:00Z
## 3 2016-01-01T00:00:00Z
## 4 2016-01-01T00:00:00Z
## 5 2016-01-01T00:00:00Z
## 6 2016-01-02T00:00:00Z
tail(date_range)
##                 pub_date
## 955 2016-07-12T00:00:00Z
## 956 2016-07-12T00:00:00Z
## 957 2016-07-12T20:02:45Z
## 958 2016-07-13T00:00:00Z
## 959 2016-07-14T00:00:00Z
## 960 2016-07-17T00:00:00Z
#see document types
table(df$document_type)
## 
##    article   blogpost     column multimedia 
##        873         79          1          7

The use of the word philosophy can be found in all sections. The most popular news_desk after “none” is the “OpEd” section followed by “Sports”. For section_name, “Sports”, “U.S.”, and “Opinion” make the top three. “News”, “Blog”, and “Review” are the top type_of_material. We also see lots of “Paid Death Notice” and “Obituary” entries.

#see news_desk
head(arrange(data.frame(table(df$news_desk)),desc(Freq)))
##         Var1 Freq
## 1       None  269
## 2       OpEd   86
## 3     Sports   58
## 4    Culture   57
## 5 Classified   51
## 6   National   49
#see section_name
head(arrange(data.frame(table(df$section_name)),desc(Freq)))
##      Var1 Freq
## 1  Sports  151
## 2    U.S.  137
## 3 Opinion  111
## 4   World   83
## 5    Arts   77
## 6   Books   62
#see type_of_material
head(arrange(data.frame(table(df$type_of_material)),desc(Freq)))
##                Var1 Freq
## 1              News  620
## 2              Blog   79
## 3            Review   57
## 4             Op-Ed   54
## 5 Paid Death Notice   51
## 6          Obituary   25

The most words in any hit are 22572, which comes from the transcript of the Republican debate in March. The shortest comes from a video in which “Shaun talks about his training philosophy.”

#max word count
max(df$word_count, na.rm = TRUE)
## [1] 22572
subset(df,word_count == 22572)
##                                                                                                            web_url
## 379 http://www.nytimes.com/2016/03/11/us/politics/transcript-of-the-republican-presidential-debate-in-florida.html
##                                                                                             snippet
## 379 Following is a transcript of the Republican debate, as transcribed by the Federal News Service.
##                                                                                      lead_paragraph
## 379 Following is a transcript of the Republican debate, as transcribed by the Federal News Service.
##     abstract print_page             source             pub_date
## 379     <NA>         NA The New York Times 2016-03-11T00:00:00Z
##     document_type news_desk section_name subsection_name type_of_material
## 379       article  National         U.S.        Politics             News
##     word_count                                                        main
## 379      22572 Transcript of the Republican Presidential Debate in Florida
##     print_headline
## 379           <NA>
#min word count
min(df$word_count, na.rm = TRUE)
## [1] 6
subset(df,word_count == 6)
##                                                                                                      web_url
## 630 http://www.nytimes.com/video/movies/100000004363589/shaun-white-russia-calling-a-different-approach.html
##                                        snippet
## 630 Shaun talks about his training philosophy.
##                                 lead_paragraph abstract print_page
## 630 Shaun talks about his training philosophy.     <NA>         NA
##                     source             pub_date document_type news_desk
## 630 Internet Video Archive 2016-04-27T23:22:15Z    multimedia    Movies
##     section_name subsection_name type_of_material word_count
## 630       Movies            <NA>            Video          6
##                                                 main print_headline
## 630 Shaun White: Russia Calling A Different Approach           <NA>

Philosophy

What are the most common words in this collection of articles? That is, what is philosophy associated with? I use the words from the lead_paragraph to find out.

library(tm)
library(stringr)

text_all_df<-c()

for(i in 1:nrow(df)){

#remove punctuation
text<-str_replace_all(df$lead_paragraph[i],"[:punct:]","")

#tolower
text<-tolower(text)

#remove stop words
text<-removeWords(text, stopwords('english'))

#split text
text_split<-unlist(str_split(text," "))

#remove blanks
text_clean<-
  text_split[!text_split==""]

if(length(text_clean)>0) {
    #combine with dataframe
    text_clean<-cbind(i,text_clean)
    text_all_df<-rbind(text_all_df, text_clean)
    }
}

text_all_df<-data.frame(text_all_df, stringsAsFactors = FALSE)
names(text_all_df)<-c("article","word")

#show first 10 words
head(text_all_df, n=10)
##    article       word
## 1        1    witness
## 2        1  operating
## 3        1       room
## 4        1   patients
## 5        1  conscious
## 6        2    turning
## 7        2      2015s
## 8        2 successful
## 9        2  investors
## 10       2   analysts

The top 10 words include “new”, “university”, “philosophy”, and “years”. A word cloud is also below with all words that have counts of 25 or more.

#see top words
top_words<-data.frame(table(text_all_df$word), stringsAsFactors = FALSE)
head(arrange(top_words,desc(Freq)), n = 10)
##          Var1 Freq
## 1         new  185
## 2  university  142
## 3  philosophy  107
## 4       years   98
## 5        york   91
## 6        will   84
## 7         one   75
## 8       world   70
## 9          st   63
## 10       city   58
#create wordcloud
library(wordcloud)
wordcloud(top_words$Var1, top_words$Freq, min.freq=25,colors=brewer.pal(6, "Dark2"),scale=c(4, .1))

We see that “new” and “york” are common, as we might expect. The fact that philosophy is an academic discipline can help explain “university”, “professor”, “students”, and “school”. We can also see evidence of the obituaries in words like “born”, “survived”, “died”, “years”,and “life”. There is also evidence of more political topics shown by “president”, “united”, “american”. Finally, we can take philosophy to be a description of one’s way of life, probably related to the obituaries: “love”, “friends”, “service”, “work”, and “beloved”.

In short, “philosophy” is associated with many kinds of articles covering a wide variety of topics in the New York Times article api.