library(jsonlite)
library(data.table)
library(dplyr)
library(knitr)
library(ggplot2)
The New York Times web site provides a rich set of APIs, as described here: http://developer.nytimes.com/docs You’ll need to start by signing up for an API key. Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it to an R dataframe.
Defined the parameters of the query (search key words, start/end data and the api key)
# March Madnes Query:
term <- "march+madness+basketball"
begin_date <- "20180301"
end_date <- "20180331"
api.key<- "&api-key=ace27904c56848de8cf3c9a855163697"
madness_url <- paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",term,"&begin_date=",begin_date,"&end_date=",
end_date)
madness_url
## [1] "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=march+madness+basketball&begin_date=20180301&end_date=20180331"
Used the “fromJSON” call to work with the API generate and parse the “madness_url”call
#Fetch the data, only gives 10 rows at a time
march_madness_query <- data.frame(fromJSON(paste0(madness_url, api.key)))%>%
select (-c(status, copyright, response.docs.blog,
response.docs.multimedia, response.docs.headline,
response.docs.keywords, response.docs.byline))
march_madness_query
response.docs.web_url
1 https://www.nytimes.com/2018/03/13/learning/march-madness.html 2 https://www.nytimes.com/2018/03/27/opinion/poetry-march-madness-basketball.html 3 https://www.nytimes.com/aponline/2018/03/21/sports/ncaabasketball/ap-bkc-ncaa-tournament-regional-guide.html 4 https://www.nytimes.com/aponline/2018/03/19/sports/ncaabasketball/ap-bkc-ncaa-amped-up-madness.html 5 https://www.nytimes.com/2018/03/15/briefing/pennsylvania-elizabeth-holmes-march-madness.html 6 https://www.nytimes.com/2018/03/15/arts/television/whats-on-tv-thursday-march-madness-and-how-to-get-away-with-murder.html 7 https://www.nytimes.com/2018/03/17/sports/32-teams-7-days-1-gym.html 8 https://www.nytimes.com/2018/03/22/learning/student-walkouts-march-madness-and-risk-enhanced-playgrounds-our-favorite-student-comments-of-the-week.html 9 https://www.nytimes.com/2018/03/12/sports/march-madness-predictions-experts.html 10 https://www.nytimes.com/interactive/2018/03/20/learning/20StudentNewsQuiz-Hawking-Basketball-Protests.html response.docs.snippet 1 Who do you predict to win the tournament this year? 2 At Loyola, our poetry workshop was energized by our team’s win, not overshadowed by it. 3 The opening weekend of the NCAA Tournament was not madness. It was straight bonkers. 4 One word succinctly describes what’s transpired so far in the NCAA Tournament: 5 Here’s what you need to know to start your day. 6 March Madness begins with games all afternoon and evening. And watch season finales with RuPaul and Annalise Keating. 7 For true madness, see the N.A.I.A. basketball tournament in Kansas City. 8 The best teenage comments from last week’s writing prompts, and an invitation to join the conversation this week. 9 There is no shortage of experts with advice to guide you to all the winners in the N.C.A.A. tournament. Here’s a roundup of selections from people in a position to know. 10 How well did you follow the news this past week? How many of these 10 questions can you get right? response.docs.source response.docs.pub_date 1 The New York Times 2018-03-13T07:00:01+0000 2 The New York Times 2018-03-28T00:05:01+0000 3 AP 2018-03-21T18:48:29+0000 4 AP 2018-03-19T07:09:55+0000 5 The New York Times 2018-03-15T09:33:51+0000 6 The New York Times 2018-03-15T05:00:05+0000 7 The New York Times 2018-03-17T20:49:48+0000 8 The New York Times 2018-03-22T19:53:13+0000 9 The New York Times 2018-03-12T13:21:38+0000 10 The New York Times 2018-03-20T11:41:10+0000 response.docs.document_type response.docs.new_desk 1 article Learning 2 article OpEd 3 article None 4 article None 5 article NYTNow 6 article Culture 7 article Sports 8 article Learning 9 article Sports 10 multimedia The Learning Network response.docs.type_of_material response.docs._id 1 News 5aa776f747de81a90120def1 2 Op-Ed 5abadc3047de81a90121843f 3 News 5ab2a90047de81a901214a7a 4 News 5aaf624547de81a9012125f3 5 briefing 5aaa3e0247de81a901210423 6 Schedule 5aa9fdd947de81a90121027f 7 News 5aad7f7047de81a901211f17 8 News 5ab409b047de81a90121580d 9 News 5aa67ee547de81a90120d71b 10 Interactive Feature 5ab0f35f47de81a901213223 response.docs.word_count response.docs.score 1 108 0.04048129 2 926 0.02787079 3 737 0.02224132 4 1043 0.02179713 5 1191 0.01992367 6 480 0.01820390 7 1637 0.01780204 8 6452 0.01732480 9 1393 0.01672325 10 0 0.01666914 response.docs.uri 1 nyt://article/f990a3f7-b3e0-54c0-8973-ce8afb7f7bb8 2 nyt://article/9ea6717c-a398-5437-b60c-615212754e62 3 nyt://article/ccf9aebc-5436-5d2d-b15c-9b8a1fe27763 4 nyt://article/9454a0bb-9bcf-5206-9582-7f95b946f38d 5 nyt://article/68ae1f67-82e3-5142-85ee-b6b114c41fbf 6 nyt://article/9760a8f1-ad18-5fc6-9286-09a6f6d74c87 7 nyt://article/a352017e-2f03-5fad-961b-a54760dedf33 8 nyt://article/1b248820-b37c-5928-b9f3-db14d0312c54 9 nyt://article/3bb032a8-298a-5574-8ecc-9d9fe7cc2a95 10 nyt://interactive/c4d1deca-9d9e-5463-ae19-7d9743f413c7 response.docs.print_page response.docs.section_name response.meta.hits 1
Find the max pages available. There is a column in the output “response.meta.hit” which provides the max number of pages with results (10 per page). Divide by 10 to find the range for a loop function.
max_pages<-round((march_madness_query$response.meta.hits[1] / 10)-1)
max_pages
## [1] 16
Looped through the response pages and combined the results
responses <- list()
for(i in 0:max_pages){
madness.search <- data.frame(fromJSON(paste0(madness_url, api.key, "&page=", i), flatten = TRUE))%>%
select(response.docs.web_url, response.docs.source, response.docs.word_count,
response.docs.new_desk,response.docs.type_of_material, response.docs.pub_date)%>%
rename(url = response.docs.web_url,source = response.docs.source,
word_count = response.docs.word_count, news_desk = response.docs.new_desk,
material_type = response.docs.type_of_material, publish_date = response.docs.pub_date)
responses[[i+1]] <- madness.search
Sys.sleep(1)
}
combined_responses <- rbind_pages(responses)
kable(head(combined_responses,20), caption = "March Madness Articles")
Found which news publishing desks publish the most march madness articles. Unsurprisingly, the sports desks published the most articles
combined_responses %>%
group_by(news_desk) %>%
summarize(count=n()) %>%
filter(news_desk != "None")%>%
mutate(percent = (count / sum(count))*100) %>%
ggplot() +
geom_bar(aes(y=percent, x=reorder(news_desk, -percent), fill= "tomato3"), stat = "identity") + coord_flip()+
labs(x='News Desk',
y='Percent Total',
title="Publishing Desk",
caption="Source: New York Times API") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
Explored the distribution of the publishing date. It is very interesting to see the bulk of the articles came out toward the of the month.
combined_responses %>%
mutate(day=gsub("T.*","",publish_date)) %>%
group_by(day) %>%
summarise(count=n()) %>%
ggplot() +
geom_bar(aes(x=reorder(day, -count), y=count), fill= "yellow", stat="identity") + coord_flip()+
labs(x='Publishing Date',
y='Count',
title="Publishing Date",
caption="Source: New York Times API") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
Explored the distribution of the sources. I found it very interesting that the associated press sourced more of the content than the NY Times.
combined_responses %>%
group_by(source) %>%
summarise(count=n()) %>%
ggplot() +
geom_bar(aes(x=reorder(source, -count), y=count), fill= "blue", stat="identity") + coord_flip()+
labs(x='Source',
y='Count',
title="Article Source",
caption="Source: New York Times API") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))