Business Case

The news is integral to Fitch’s core businesses. Analysts scour and track the news to stay informed, and to, for Fitch’s benefit, provide perspective in our ratings, research and other commentary. Despite the importance of news, Fitch does not have a tool that allows analysts to keep track of or analyze news related to the industries and/or entities that they cover.

Our tool seeks to address these problems (when fully developed). Firstly, the tool would be able to extract articles related to a particular entity or sector. Analysts conduct news searches as part of the ratings process and research, and our tool could help facilitate that step. Secondly, the sentiment analysis would provide an easier way to derive context from the news. It’s often hard to get a sense of how news has trended over long periods of time, given the vast amount of information available, therefore the tool’s visualization of the sentiment score data would help users in identifying trends. The analysis would be furthered by the tool’s comparison tool, which allows the user to compare multiple entities and/or sectors with one another.

Ultimately, our tool would be useful for the company as it would not only streamline processes that Fitch employees conduct day-to-day, but it would also conduct a broad analysis that many users would find valuable.

API Calls

We considered various sources and techniques to pull a set of news articles related to Sears Corporation over a set time period (January 1, 2006 to November 1, 2017).

################################# NY Times SEARCH API ####################################

#Page 1
nyTimes <- GET("http://api.nytimes.com/svc/search/v2/articlesearch.json?query=sears&fq=news_desk=Business&begin_date=20060201&end_date=20130101&page=1&api-key=66dee3f85fdc49f48c39adbb932ad0d1")

#funtion to parse json into text
json_parse <- function(req) {
  text <- content(req, as = "text", encoding = "UTF-8")
  if (identical(text, "")) warn("No output to parse.")
  fromJSON(text)
}

#parse GET response
json_nyTimes <- json_parse(nyTimes)

#capture meta data
results <- json_nyTimes$response$docs

#subset for four parameters: url, snippet, source, publication date
results <- as.data.frame(results)
results.sub <- subset(results, select = c(web_url, snippet, source, pub_date))


#automatically run GET function on multiple pages and aggregate in one dataframe
baseURL <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?query=sears&begin_date=20060201&end_date=20171101&page="
key ="&api-key=66dee3f85fdc49f48c39adbb932ad0d1"

for(i in 2:500){
  
  page = i
  finalURL <- paste0(baseURL, page, key)
  
  nyTimes_test <- GET(finalURL) %>% stop_for_status()
  
  json_nyTimes_test <- json_parse(nyTimes_test)
  results_test <- json_nyTimes_test$response$docs
  results_test <- as.data.frame(results_test)
  results_test.sub <- subset(results_test, select = c(web_url, snippet, source, pub_date))
  
  results.sub <- rbind(results.sub, results_test.sub)
  
  #pause for 30 seconds
  Sys.sleep(30)
  
}

Cleaning the Data

We considered various sources and techniques to pull a set of news articles related to Sears Corporation over a set time period (January 1, 2006 to November 1, 2017).

#remove duplicate articles
results.sub <- unique(results.sub)

#filter for articles under Business and Opinion categories
results.sears <-filter(results.sub,grepl('/business/|/opinion/|/us',web_url))

#remove articles with empty snippets
results.sears2 <- results.sears[!(results.sears$snippet == ""),]

#isolate snippets
snippets <- results.sears2$snippet

Transform dataset into Tidyverse format

#break apart sentences
words <-test2 %>% 
  mutate(word=strsplit(as.character(snippets), " ")) %>% 
  unnest(word)

#Bing lexicon
bing <- sentiments %>%
  filter(lexicon == "bing") %>%
  select(-score)

#calculate sentiment score of first 80 lines of every Jane Austen book
snippets_ensentiment3 <- words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, index = snippet, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

#merge with BING lexicon
words.bing <- merge(bing, words, by ="word")

#create subset of snippets and dates
dates <- subset(words.bing, select = c(snippet, pub_date))

names(words.bing1) <- c("index", "snippet")

#merge dates with words.bing 1 (has sentiment score per snippet)
sentimentScores <- merge(words.bing1, dates, by = "snippet")
sentimentScores <- unique(sentimentScores)

colnames(sentimentScores) <- c("snippet", "word", "negative", "positve", "dummy", "score", "pub_date")

sentimentScores[, `Date` := NA]
sentimentScores[, `Date` := as.numeric(`Date`)]
setDT(sentimentScores)[, `Date` := format(as.Date(pub_date), "%Y/%m/%d") ]
sentimentScores[,`Date` := as.Date(`Date`)]