NY Times Sentiment Analysis of Sears Corporation

Business Case

The news is integral to Fitch’s core businesses. Analysts scour and track the news to stay informed, and to, for Fitch’s benefit, provide perspective in our ratings, research and other commentary. Despite the importance of news, Fitch does not have a tool that allows analysts to keep track of or analyze news related to the industries and/or entities that they cover.

Our tool seeks to address these problems (when fully developed). Firstly, the tool would be able to extract articles related to a particular entity or sector. Analysts conduct news searches as part of the ratings process and research, and our tool could help facilitate that step. Secondly, the sentiment analysis would provide an easier way to derive context from the news. It’s often hard to get a sense of how news has trended over long periods of time, given the vast amount of information available, therefore the tool’s visualization of the sentiment score data would help users in identifying trends. The analysis would be furthered by the tool’s comparison tool, which allows the user to compare multiple entities and/or sectors with one another.

Ultimately, our tool would be useful for the company as it would not only streamline processes that Fitch employees conduct day-to-day, but it would also conduct a broad analysis that many users would find valuable.

API Calls

We considered various sources and techniques to pull a set of news articles related to Sears Corporation over a set time period (January 1, 2006 to November 1, 2017).

################################# NY Times SEARCH API ####################################

#Page 1
nyTimes <- GET("http://api.nytimes.com/svc/search/v2/articlesearch.json?query=sears&fq=news_desk=Business&begin_date=20060201&end_date=20130101&page=1&api-key=66dee3f85fdc49f48c39adbb932ad0d1")

#funtion to parse json into text
json_parse <- function(req) {
  text <- content(req, as = "text", encoding = "UTF-8")
  if (identical(text, "")) warn("No output to parse.")
  fromJSON(text)
}

#parse GET response
json_nyTimes <- json_parse(nyTimes)

#capture meta data
results <- json_nyTimes$response$docs

#subset for four parameters: url, snippet, source, publication date
results <- as.data.frame(results)
results.sub <- subset(results, select = c(web_url, snippet, source, pub_date))


#automatically run GET function on multiple pages and aggregate in one dataframe
baseURL <- "http://api.nytimes.com/svc/search/v2/articlesearch.json?query=sears&begin_date=20060201&end_date=20171101&page="
key ="&api-key=66dee3f85fdc49f48c39adbb932ad0d1"

for(i in 2:500){
  
  page = i
  finalURL <- paste0(baseURL, page, key)
  
  nyTimes_test <- GET(finalURL) %>% stop_for_status()
  
  json_nyTimes_test <- json_parse(nyTimes_test)
  results_test <- json_nyTimes_test$response$docs
  results_test <- as.data.frame(results_test)
  results_test.sub <- subset(results_test, select = c(web_url, snippet, source, pub_date))
  
  results.sub <- rbind(results.sub, results_test.sub)
  
  #pause for 30 seconds
  Sys.sleep(30)
  
}

Cleaning the Data

We considered various sources and techniques to pull a set of news articles related to Sears Corporation over a set time period (January 1, 2006 to November 1, 2017).

#remove duplicate articles
results.sub <- unique(results.sub)

#filter for articles under Business and Opinion categories
results.sears <-filter(results.sub,grepl('/business/|/opinion/|/us',web_url))

#remove articles with empty snippets
results.sears2 <- results.sears[!(results.sears$snippet == ""),]

#isolate snippets
snippets <- results.sears2$snippet

Transform dataset into Tidyverse format

#break apart sentences
words <-test2 %>% 
  mutate(word=strsplit(as.character(snippets), " ")) %>% 
  unnest(word)

#Bing lexicon
bing <- sentiments %>%
  filter(lexicon == "bing") %>%
  select(-score)

#calculate sentiment score of first 80 lines of every Jane Austen book
snippets_ensentiment3 <- words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, index = snippet, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

#merge with BING lexicon
words.bing <- merge(bing, words, by ="word")

#create subset of snippets and dates
dates <- subset(words.bing, select = c(snippet, pub_date))

names(words.bing1) <- c("index", "snippet")

#merge dates with words.bing 1 (has sentiment score per snippet)
sentimentScores <- merge(words.bing1, dates, by = "snippet")
sentimentScores <- unique(sentimentScores)

colnames(sentimentScores) <- c("snippet", "word", "negative", "positve", "dummy", "score", "pub_date")

sentimentScores[, `Date` := NA]
sentimentScores[, `Date` := as.numeric(`Date`)]
setDT(sentimentScores)[, `Date` := format(as.Date(pub_date), "%Y/%m/%d") ]
sentimentScores[,`Date` := as.Date(`Date`)]

Visualizing Sentiment Trends

Net Sentiment Score per snippet of articles in our dataset:

#subset for just scores and date
scores <- subset(dt, select = c(score, Date))
setDT(scores)[, `Date` := format(as.Date(Date), "%Y/%m") ]

ggplot(scores,aes(Date, score)) + geom_bar(stat = "identity", show.legend = FALSE,colour="purple")+ 
        xlab("Date") + ylab("Sentiment by Article") +
        ggtitle("Net Sentiment Score by article") +
        theme(axis.text.x = element_text(angle = 90, hjust = 5))

words_frequency <- as.data.frame(table(dt$word))

wordcloud(words = words_frequency$Var1, freq = words_frequency$Freq, min.freq = 1,
          max.words=80, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Next Steps

Our project would be useful for Fitch’s workflow, given that it would automate processes that analysts are currently required to do. Moreover, the sentiment analysis would provide for interesting insight and commentary in both ratings and research.

We would need to do a number of things to ready or project for development. Firstly, we would need to incorporate additional news sources into the tool. More importantly, though, we would need to refine our sentiment analysis. We realized that the analysis would most likely need to be tailored by news source, given that writing style and vocabulary can vary greatly between one another. Otherwise, we would also need to incorporate a weighting of sentiment, so that the scores are calculated relative to one another. We would also like to create a dashboard, which would include prominent news stories and Fitch’s rating history across the sentiment score time-series.