Overview

The New York Times web site provides a rich set of APIs. The topic I have selected is Machine learning Article search API to look up articles by keywords. The results are refined by fitlers and facets. The goal is to use New York Times API and fetch the data with certain parameters and converting the fetched data into " R Data frame"


R Packages Used

This assignment was accomplished by utilizing these packages for both data analysis and visualizations.

library("dplyr")
library("XML")
library("jsonlite")
library("ggplot2")
library("DT")
library("kableExtra")

New York Times API

The data collected from New York Times articles are related to Machine Learning published from 01/01/2019 and 03/31/2019.

Fetching the Data

The Data is fetched using API key regitered with NYT develpers site. The base url contains different parameters.

# Let's set some parameters
term <- "machine+learning" # Need to use + to string together separate words
begin_date <- "20190101"
end_date <- "20190331"
api_key<-"9QjNM7qvbsLVAAErpgHwklHMWUhpbEIG"

base_url <- paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",term,
                  "&begin_date=",begin_date,"&end_date=",end_date,
                  "&facet_filter=true&api-key=",api_key, sep="")


base_query <- fromJSON(base_url)

# Maximum pages with selected parameters

maxPages<-round(base_query$response$meta$hits[1]/10-1)

# Fetching the data based on the number of pages in R Data Frame

pages <- list()
for(i in 0:maxPages){
  nyt_search <- fromJSON(paste0(base_url, "&page=", i), flatten = TRUE) %>% data.frame() 
  message("Retrieving page ", i)
  pages[[i+1]] <- nyt_search 
  Sys.sleep(6) 
}
## Retrieving page 0
## Retrieving page 1
## Retrieving page 2
## Retrieving page 3
## Retrieving page 4
## Retrieving page 5
## Retrieving page 6
## Retrieving page 7
## Retrieving page 8

Final output

all_NYT_Articles <- rbind_pages(pages)

# column names

names(all_NYT_Articles)
##  [1] "status"                               
##  [2] "copyright"                            
##  [3] "response.docs.web_url"                
##  [4] "response.docs.snippet"                
##  [5] "response.docs.lead_paragraph"         
##  [6] "response.docs.abstract"               
##  [7] "response.docs.source"                 
##  [8] "response.docs.multimedia"             
##  [9] "response.docs.keywords"               
## [10] "response.docs.pub_date"               
## [11] "response.docs.document_type"          
## [12] "response.docs.news_desk"              
## [13] "response.docs.section_name"           
## [14] "response.docs.type_of_material"       
## [15] "response.docs._id"                    
## [16] "response.docs.word_count"             
## [17] "response.docs.uri"                    
## [18] "response.docs.print_page"             
## [19] "response.docs.subsection_name"        
## [20] "response.docs.headline.main"          
## [21] "response.docs.headline.kicker"        
## [22] "response.docs.headline.content_kicker"
## [23] "response.docs.headline.print_headline"
## [24] "response.docs.headline.name"          
## [25] "response.docs.headline.seo"           
## [26] "response.docs.headline.sub"           
## [27] "response.docs.byline.original"        
## [28] "response.docs.byline.person"          
## [29] "response.docs.byline.organization"    
## [30] "response.meta.hits"                   
## [31] "response.meta.offset"                 
## [32] "response.meta.time"
# Subsetting and filtering the data

ML_articles<- all_NYT_Articles %>% filter(response.docs.document_type=="article") %>% 
  select(response.docs.uri,response.docs.source,response.docs.subsection_name,response.docs.type_of_material,response.docs.pub_date,response.docs.word_count)

# Data frame
datatable(ML_articles)

Analysis & Conclusion

Count of ML Articles

df_ml<- ML_articles %>% filter(response.docs.source=='The New York Times') %>% group_by(response.docs.type_of_material) %>% summarise(Total.Count=sum(response.docs.word_count))

ggplot(data=df_ml, aes(x=reorder( response.docs.type_of_material,Total.Count), y=Total.Count)) +
  geom_bar(stat="identity", position=position_dodge())+
  xlab("Type of Material")+
    theme_minimal()

Conclusion

The number of articles published in New York Times are greater compared to other material type.