Team Members

Data

Our primary data will come from the NEWSAPI, we will be collecting the news articles related to top fortune 500 companies.

Secondary data: We are planning to fetch historical stock price data of these companies using ‘marketstack’ or a similar API.

Prepare, Clean and Transform: The data we are getting from the NEWS API is in the json format, we are planning to pick up relevant fields like author, title, description, publish date and content.

tidy-data-variables

Once the data set is ready, we are planning to use dplyr, tidyverse libraries to clean and transform data, which includes dropping null articles, transform date column to human readable format and pivot longer the data set.

Problem Description

Rational choice theory states that individuals use rational calculations to make rational choices and achieve outcomes that are aligned with their own personal objectives. These results are also associated with maximizing an individual’s self-interest.

We believe that the rational choice theory is not absolute, the existence of ethos tells us that sentiments and emotions will always play a crucial role in any system that is based on human actions. Every passing day more and more individual investors are becoming an active part of the stock market, this seems like an appropriate crossroad to gauge the effect of sentiments on the stock price.

Analytics Plan

Accurately outlining factors affecting price or even predicting stock prices is definitely not simple. The huge number of variables involved coupled with human inter-dependency makes stock market a virtually chaotic science. But the stock market also tends to be forward-looking, which means that it reflects investor outlook on the economy.

NLP (Sentiment Analysis): After we clean and stage our data, using VADER we will classify each article regarding different companies as POSITIVE, NEGATIVE or NEUTRAL. We will also try to factor in the polarity of the sentiment projected by the article. This can be done by refining a document and extracting keywords which will then be ranked by comparing these words to a predefined lexicon containing polarities.

Then we will plot these sentiments against the stock price in that particular period when the article was published and try to establish a correlation. Other than the above mentioned we have thought of a number of other insights that can be helpful in understanding the source data and it’s quirks if any.

Potential for other insights: understanding if there are any inherent biases shown by a publisher against a particular company by comparing the overall sentiment against individual publisher.

Evaluation Plan

The media ecosystem as a whole can be considered as the single most impactful factor in forming the sentiments of the ‘market’. Therefore, our evaluation of the study will be based on whether we were correlate a particular stock’s trend against the sentiment demanded by the company in the news.

We will be cautious about some obvious limitations like we can only collect articles for just a month which is a very small subset.