The New York Times web site provides a rich set of APIs, as described here: http://developer.nytimes.com/docs
You’ll need to start by signing up for an API key.
Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it to an R dataframe.
I chose to get a key for the Article Search API. First, I need to specify a query. Since I studied philosophy in graduate school, I am going to search for “philosophy” as the keyword from January 01, 2016 through October 01, 2016.
library(jsonlite)
library(tidyr)
library(dplyr)apikey<-"?api-key=046721d6539e4338a818eb9e1d199ac9"
#query - search for philosophy articles
q<-"&q=philosophy"
#begin date
begin_date<-"&begin_date=20160101"
#end date
end_date<-"&end_date=20161001"
#sort
sort<-"&sort=oldest"
#page
page<-"&page=0"
#get documents
docs<-fromJSON(
paste0("https://api.nytimes.com/svc/search/v2/articlesearch.json"
,apikey
,q
,begin_date
,end_date
,sort
,page
)
)
#see number of hits
docs$response$meta$hitsI see that I have 1336 hits. So I am going to create a loop to pull in all of those hits into a data frame. Since I have 10 responses per page, to get all 1336 hits, I need to loop from 0 to 133 times (133*10 = 1330). So as to not have to repull the data, I save the data into a .csv file for future use.
#initialize df
df<-c()
#loop
for(i in 0: 133){
#page
page<-paste0("&page=",i)
#try
try(
{
#get documents
docs<-fromJSON(
paste0("https://api.nytimes.com/svc/search/v2/articlesearch.json"
,apikey
,q
,begin_date
,end_date
,sort
,page
)
)
#get data.frames
temp<-docs$response$docs
headline<-docs$response$docs$headline
#select columns and remove row.names
temp<-select(temp,web_url,snippet,lead_paragraph,abstract,print_page,source,pub_date,document_type,news_desk,section_name,subsection_name,type_of_material,word_count)
headline<-select(headline, main, print_headline)
temp<-data.frame(temp, row.names = NULL, stringsAsFactors = FALSE)
headline<-data.frame(headline,row.names = NULL, stringsAsFactors = FALSE)
#combine into one data.frame
temp<-cbind(temp,headline)
#bind to previous data
df<-rbind(df,temp)
#sleep-5 seconds to comply with rate limit
Sys.sleep(5)
}, silent = TRUE)
#print i and rows to monitor progress
print(c(i, nrow(df)))
}
#view column names
data.frame(names(df))
#write results to csv to save for use
write.csv(df,file = "C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/assignment9_df.csv")Due to errors that would occur and stop my code (e.g., accessing a forbidden directory), I had to wrap my code in a try statement. And to not hit the server too many times and get another error code due to rate limitations, I added a 5 second crawl delay and that solved the error. Also, I only returned 960 results in the code above because I had hit my rate limit for the day (i.e., 1000 returns. I had used 40 already for getting the code running and tested).
The dataframe contains 960 observations with 15 columns. The column names are:
The dates range from January 1, 2016 through July 17, 2016. Most of the results are articles (873), followed by blogposts (79).
df<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/assignment9_df.csv", stringsAsFactors = FALSE)
df$X<-NULL
#see date range
date_range<-df %>% select(pub_date) %>% arrange(pub_date)
head(date_range)## pub_date
## 1 2016-01-01T00:00:00Z
## 2 2016-01-01T00:00:00Z
## 3 2016-01-01T00:00:00Z
## 4 2016-01-01T00:00:00Z
## 5 2016-01-01T00:00:00Z
## 6 2016-01-02T00:00:00Z
tail(date_range)## pub_date
## 955 2016-07-12T00:00:00Z
## 956 2016-07-12T00:00:00Z
## 957 2016-07-12T20:02:45Z
## 958 2016-07-13T00:00:00Z
## 959 2016-07-14T00:00:00Z
## 960 2016-07-17T00:00:00Z
#see document types
table(df$document_type)##
## article blogpost column multimedia
## 873 79 1 7
The use of the word philosophy can be found in all sections. The most popular news_desk after “none” is the “OpEd” section followed by “Sports”. For section_name, “Sports”, “U.S.”, and “Opinion” make the top three. “News”, “Blog”, and “Review” are the top type_of_material. We also see lots of “Paid Death Notice” and “Obituary” entries.
#see news_desk
head(arrange(data.frame(table(df$news_desk)),desc(Freq)))## Var1 Freq
## 1 None 269
## 2 OpEd 86
## 3 Sports 58
## 4 Culture 57
## 5 Classified 51
## 6 National 49
#see section_name
head(arrange(data.frame(table(df$section_name)),desc(Freq)))## Var1 Freq
## 1 Sports 151
## 2 U.S. 137
## 3 Opinion 111
## 4 World 83
## 5 Arts 77
## 6 Books 62
#see type_of_material
head(arrange(data.frame(table(df$type_of_material)),desc(Freq)))## Var1 Freq
## 1 News 620
## 2 Blog 79
## 3 Review 57
## 4 Op-Ed 54
## 5 Paid Death Notice 51
## 6 Obituary 25
The most words in any hit are 22572, which comes from the transcript of the Republican debate in March. The shortest comes from a video in which “Shaun talks about his training philosophy.”
#max word count
max(df$word_count, na.rm = TRUE)## [1] 22572
subset(df,word_count == 22572)## web_url
## 379 http://www.nytimes.com/2016/03/11/us/politics/transcript-of-the-republican-presidential-debate-in-florida.html
## snippet
## 379 Following is a transcript of the Republican debate, as transcribed by the Federal News Service.
## lead_paragraph
## 379 Following is a transcript of the Republican debate, as transcribed by the Federal News Service.
## abstract print_page source pub_date
## 379 <NA> NA The New York Times 2016-03-11T00:00:00Z
## document_type news_desk section_name subsection_name type_of_material
## 379 article National U.S. Politics News
## word_count main
## 379 22572 Transcript of the Republican Presidential Debate in Florida
## print_headline
## 379 <NA>
#min word count
min(df$word_count, na.rm = TRUE)## [1] 6
subset(df,word_count == 6)## web_url
## 630 http://www.nytimes.com/video/movies/100000004363589/shaun-white-russia-calling-a-different-approach.html
## snippet
## 630 Shaun talks about his training philosophy.
## lead_paragraph abstract print_page
## 630 Shaun talks about his training philosophy. <NA> NA
## source pub_date document_type news_desk
## 630 Internet Video Archive 2016-04-27T23:22:15Z multimedia Movies
## section_name subsection_name type_of_material word_count
## 630 Movies <NA> Video 6
## main print_headline
## 630 Shaun White: Russia Calling A Different Approach <NA>
What are the most common words in this collection of articles? That is, what is philosophy associated with? I use the words from the lead_paragraph to find out.
library(tm)
library(stringr)
text_all_df<-c()
for(i in 1:nrow(df)){
#remove punctuation
text<-str_replace_all(df$lead_paragraph[i],"[:punct:]","")
#tolower
text<-tolower(text)
#remove stop words
text<-removeWords(text, stopwords('english'))
#split text
text_split<-unlist(str_split(text," "))
#remove blanks
text_clean<-
text_split[!text_split==""]
if(length(text_clean)>0) {
#combine with dataframe
text_clean<-cbind(i,text_clean)
text_all_df<-rbind(text_all_df, text_clean)
}
}
text_all_df<-data.frame(text_all_df, stringsAsFactors = FALSE)
names(text_all_df)<-c("article","word")
#show first 10 words
head(text_all_df, n=10)## article word
## 1 1 witness
## 2 1 operating
## 3 1 room
## 4 1 patients
## 5 1 conscious
## 6 2 turning
## 7 2 2015s
## 8 2 successful
## 9 2 investors
## 10 2 analysts
The top 10 words include “new”, “university”, “philosophy”, and “years”. A word cloud is also below with all words that have counts of 25 or more.
#see top words
top_words<-data.frame(table(text_all_df$word), stringsAsFactors = FALSE)
head(arrange(top_words,desc(Freq)), n = 10)## Var1 Freq
## 1 new 185
## 2 university 142
## 3 philosophy 107
## 4 years 98
## 5 york 91
## 6 will 84
## 7 one 75
## 8 world 70
## 9 st 63
## 10 city 58
#create wordcloud
library(wordcloud)
wordcloud(top_words$Var1, top_words$Freq, min.freq=25,colors=brewer.pal(6, "Dark2"),scale=c(4, .1))We see that “new” and “york” are common, as we might expect. The fact that philosophy is an academic discipline can help explain “university”, “professor”, “students”, and “school”. We can also see evidence of the obituaries in words like “born”, “survived”, “died”, “years”,and “life”. There is also evidence of more political topics shown by “president”, “united”, “american”. Finally, we can take philosophy to be a description of one’s way of life, probably related to the obituaries: “love”, “friends”, “service”, “work”, and “beloved”.
In short, “philosophy” is associated with many kinds of articles covering a wide variety of topics in the New York Times article api.