The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/apis You’ll need to start by signing up for an API key. Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.
This project contains the following Sections:
I created a developers account at the New York Times (NYT) website and created a key connected to my account. This key will be used to access data from the NYT site.
API.Key <- "dk5upqVbE5VhXzCZwacqGB5O17iAyGKW"
Using the Article Search API I will download news article abstracts from the previous two years with the following search terms:
Term | Ecuador Begin Date | 2017-10-20 End Date | 2019-10-20
We then can generate a search string “MySearch.url” to use with the the NYT API.
term <- "ecuador" # Search Term
begin_date <- "20171020" #Begin Date
end_date <- "20191020" #End Date
MySearch.url <- paste0("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=",term,
"&begin_date=",begin_date,"&end_date=",end_date,
"&facet_filter=true&api-key=",API.Key, sep="")
Every search string will have a different number of hits or articles. We will find how many hits and using this number the number of search results pages for our chosen search string.
GetHits <- fromJSON(MySearch.url)
PageNumber <- round((GetHits$response$meta$hits[1] / 10)-1)
Then we can request the calculated number of search pages and load them into a list.
pages <- list()
for(i in 0:PageNumber){
nytSearch <- fromJSON(paste0(MySearch.url, "&page=", i), flatten = TRUE) %>% data.frame()
#message("Retrieving page ", i)
pages[[i+1]] <- nytSearch
Sys.sleep(10)
}
Combine the list of search results pages into a single data frame.
allNYTSearch <- rbind_pages(pages)
Let’s do some data cleanup by only keeping rows of interest and changing the format of the datetime column from String to Date.
allNYTSearch$response.docs.pub_date <- as.POSIXct(allNYTSearch$response.docs.pub_date,format="%Y-%m-%d")
short_df <- select(allNYTSearch, response.docs.abstract, response.docs.pub_date, response.docs.word_count, response.docs.headline.main)
short_df$response.docs.pub_date <- as.Date(short_df$response.docs.pub_date)
The results are full of words that do not add anything to our analysis. The so called “stop words”. Since our article is from a Spanish speaking country, we will also need to account for spanish “stop words”
data("stop_words")
spanish_stop_words <- bind_rows(stop_words,
data_frame(word = tm::stopwords("spanish"),
lexicon = "custom"))
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
Let’s divide the data set into groups based based on the year (2018 and 2019) using the “as.Date” function.
library(stringr)
NYT_2018 <- short_df %>% filter(between(response.docs.pub_date, as.Date("2018-01-01"), as.Date("2018-12-31")))
NYT_2019 <- short_df %>% filter(between(response.docs.pub_date, as.Date("2019-01-01"), as.Date("2019-10-15")))
dim(NYT_2018); dim(NYT_2019)
## [1] 275 4
## [1] 199 4
The following two chunks will break up the news summaries into words. Will remove english and spanish “stop words”. Then wil rank every word by a “sentiment” and finally will order the results on the frequency at which the words appear.
# get a list of words
NYT_Clean_2018 <- NYT_2018 %>%
dplyr::select(response.docs.abstract) %>%
unnest_tokens(word, response.docs.abstract) %>%
anti_join(stop_words) %>%
anti_join(spanish_stop_words) %>%
filter(!word %in% c("del", "las", "la", "de", "tu", "noticias", "en" , "el", "lo", "esta", "sucediendo", "here's"))
## Joining, by = "word"
## Joining, by = "word"
NYT_Clean_2018 %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n))
## Selecting by n
# join sentiment classification to the abstract words
bing_word_counts_2018 <- NYT_Clean_2018 %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts_2018 <- bing_word_counts_2018 %>%
group_by(sentiment) %>%
top_n(5) %>%
ungroup() %>%
mutate(word = reorder(word, n))
## Selecting by n
# get a list of words
NYT_Clean_2019 <- NYT_2019 %>%
dplyr::select(response.docs.abstract) %>%
unnest_tokens(word, response.docs.abstract) %>%
anti_join(stop_words) %>%
anti_join(spanish_stop_words) %>%
filter(!word %in% c("del", "las", "la", "de", "tu", "noticias", "en" , "el", "lo", "esta", "sucediendo", "here's"))
## Joining, by = "word"
## Joining, by = "word"
NYT_Clean_2019 %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n))
## Selecting by n
# join sentiment classification to the abstract words
bing_word_counts_2019 <- NYT_Clean_2019 %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts_2019 <- bing_word_counts_2019 %>%
group_by(sentiment) %>%
top_n(5) %>%
ungroup() %>%
mutate(word = reorder(word, n))
## Selecting by n
Let’s plot and compare side by side the sentiment analysis of the news article abstracts of the years 2018 vs 209 connected to the search term “ecuador”
library(cowplot)
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
Y2019 <- ggplot(bing_word_counts_2019,aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(title = "Sentiment 2019.",
y = "Contribution to sentiment",
x = NULL) + coord_flip()
Y2018 <- ggplot(bing_word_counts_2018,aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(title = "Sentiment 2018.",
y = "Contribution to sentiment",
x = NULL) + coord_flip()
plot_grid(Y2018, Y2019, labels = "AUTO")
We have shown the qualitavely difference in sentiment in Ecuador news between the years 2018 and 2019. This matches the local context of the country in which the year 2018 was a year when the immigration of undocumented immigrants from Venezuela into Ecuador was a major concern. This compares to the year 2019 when the country suffered a series of protest due to the reduction of fuel subsidies that hit the economically deprived segment of the Ecuadorian population.