Extracting content from diversified web resources, cleaning up the raw data, preparing them for the statistical analysis and actually performing the analysis it is far from being a simple task. The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/api . Task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.
Preparing Data
Reproducibility
Workflow
library(DT)
library(jsonlite)
library(tidyjson)
library(dplyr)
library(tidyr)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
library(tm)
library(textclean)
library(lares)baseurl<-get_creds()$`nyt.api`$baseurl
apikey <-get_creds()$`nyt.api`$apikeyget_data<- function(section) {
url<-paste(baseurl,section,".json?api-key=",sep = "")
request <- fromJSON(URLencode(paste0(url, apikey)))
stories <- request$results
newdata<-data.frame(Subsection=stories$subsection,
Title=stories$title,
Abstract=stories$abstract,
Byline=stories$byline,
Created=as.Date(stories$created_date),
'Short URL'=stories$short_url, stringsAsFactors = FALSE);
return(newdata)
}get_wordcloud<-function(dataframe) {
abstract<-dataframe$Abstract
words <- Corpus(VectorSource(abstract))
words <- words %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(replace_contraction) %>%
tm_map(replace_curly_quote) %>%
tm_map(stripWhitespace)
words <- tm_map(words, content_transformer(tolower))
words <- tm_map(words, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(words)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
df=df[-1,]
set.seed(1234) # for reproducibility
wordcloud(words = df$word, freq = df$freq, min.freq = 1,
max.words=150, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
}These are sections available to select on Top Stories API : Automobiles, Books, Business, Health, Movies, Politics, Science, Sports, Technology, Travel, US and World.
API doesnt allow fetching top stories for all sections so only selected sections are used for data analysis
df<-get_data("automobiles")datatable(df)get_wordcloud(df)df<-get_data("books")datatable(df)get_wordcloud(df)df<-get_data("sports")datatable(df)get_wordcloud(df)df<-get_data("health")datatable(df)get_wordcloud(df)