The New York Times web site provides a rich set of APIs. The topic I have selected is Machine learning Article search API to look up articles by keywords. The results are refined by fitlers and facets. The goal is to use New York Times API and fetch the data with certain parameters and converting the fetched data into " R Data frame"
This assignment was accomplished by utilizing these packages for both data analysis and visualizations.
library("dplyr")
library("XML")
library("jsonlite")
library("ggplot2")
library("DT")
library("kableExtra")The data collected from New York Times articles are related to Machine Learning published from 01/01/2019 and 03/31/2019.
The Data is fetched using API key regitered with NYT develpers site. The base url contains different parameters.
# Let's set some parameters
term <- "machine+learning" # Need to use + to string together separate words
begin_date <- "20190101"
end_date <- "20190331"
api_key<-"9QjNM7qvbsLVAAErpgHwklHMWUhpbEIG"
base_url <- paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",term,
"&begin_date=",begin_date,"&end_date=",end_date,
"&facet_filter=true&api-key=",api_key, sep="")
base_query <- fromJSON(base_url)
# Maximum pages with selected parameters
maxPages<-round(base_query$response$meta$hits[1]/10-1)
# Fetching the data based on the number of pages in R Data Frame
pages <- list()
for(i in 0:maxPages){
nyt_search <- fromJSON(paste0(base_url, "&page=", i), flatten = TRUE) %>% data.frame()
message("Retrieving page ", i)
pages[[i+1]] <- nyt_search
Sys.sleep(6)
}## Retrieving page 0
## Retrieving page 1
## Retrieving page 2
## Retrieving page 3
## Retrieving page 4
## Retrieving page 5
## Retrieving page 6
## Retrieving page 7
## Retrieving page 8
all_NYT_Articles <- rbind_pages(pages)
# column names
names(all_NYT_Articles)## [1] "status"
## [2] "copyright"
## [3] "response.docs.web_url"
## [4] "response.docs.snippet"
## [5] "response.docs.lead_paragraph"
## [6] "response.docs.abstract"
## [7] "response.docs.source"
## [8] "response.docs.multimedia"
## [9] "response.docs.keywords"
## [10] "response.docs.pub_date"
## [11] "response.docs.document_type"
## [12] "response.docs.news_desk"
## [13] "response.docs.section_name"
## [14] "response.docs.type_of_material"
## [15] "response.docs._id"
## [16] "response.docs.word_count"
## [17] "response.docs.uri"
## [18] "response.docs.print_page"
## [19] "response.docs.subsection_name"
## [20] "response.docs.headline.main"
## [21] "response.docs.headline.kicker"
## [22] "response.docs.headline.content_kicker"
## [23] "response.docs.headline.print_headline"
## [24] "response.docs.headline.name"
## [25] "response.docs.headline.seo"
## [26] "response.docs.headline.sub"
## [27] "response.docs.byline.original"
## [28] "response.docs.byline.person"
## [29] "response.docs.byline.organization"
## [30] "response.meta.hits"
## [31] "response.meta.offset"
## [32] "response.meta.time"
# Subsetting and filtering the data
ML_articles<- all_NYT_Articles %>% filter(response.docs.document_type=="article") %>%
select(response.docs.uri,response.docs.source,response.docs.subsection_name,response.docs.type_of_material,response.docs.pub_date,response.docs.word_count)
# Data frame
datatable(ML_articles)df_ml<- ML_articles %>% filter(response.docs.source=='The New York Times') %>% group_by(response.docs.type_of_material) %>% summarise(Total.Count=sum(response.docs.word_count))
ggplot(data=df_ml, aes(x=reorder( response.docs.type_of_material,Total.Count), y=Total.Count)) +
geom_bar(stat="identity", position=position_dodge())+
xlab("Type of Material")+
theme_minimal()The number of articles published in New York Times are greater compared to other material type.