Ashutosh Kaddi
Vasishta Kuchipudi
Yashwanth Reddy Kariveda
Pankaj Shinde
12/08/2021
Ashutosh Kaddi
Vasishta Kuchipudi
Yashwanth Reddy Kariveda
Pankaj Shinde
Every passing day more and more individual investors are becoming an active part of the stock market. The whole GME and r/wallstreetbets debacle shows us how any system where humans are the centre is bound to be affected by emotions. The transactions are not as brutal rational and objective as the economist tend to assume.
We understand back in the day it was virtually impossible to gauge and factor in human sentiments due to the lack of means. Due to the growing prominence social media and media in general, we can observe that there is almost instantaneous effect of any world event on the market. The mere availability of big data and the direct contribution of masses to the economy seems like the perfect opportunity to slot in the sentimental aspect of the market in modelling.
NLP (Sentiment Analysis) After we clean and stage our data, using VADER we will classify each article regarding different companies as POSITIVE, NEGATIVE or NEUTRAL. We will also try to factor in the polarity of the sentiment projected by the article.
VADER provides four output features each for every sentiment and a normalised value considering all the sentiments ranging from -1 to 1 based on the severity of the sentiment. (-1 being very negative and 1 being very postitive) Output data: Positive Negative Neutral Compound: Normalised value (-1 to 1)
Primary data: Articles of top fortune 500 companies.
Secondary data: Historical stock data of fortune 500 companies using ‘marketstack’, ‘newscatcher’, ‘mediastack’ or a similar API.
For cleaning we removed some of the NA values. We fetched data from 3 different APIs, which was in different structures (nested lists). We merged all the data by creating a flat structure. We also had to loop the API records to fetch all the data.
The data that we are working on can be crudely divided into two sections: News article data This forms our base data. We are fetching news articles using various API. Sentiment analysis will be done on these article texts. Following are the important columns in the dataset:
PublishedAt: Date when the article was published
Summary: Summary of the article
Source: Source of the article (NYT, WP)
Historical Stock Data This dataset along with the sentiment details from the Article data is going to be the input for our prediction model.
Important features: Open amount, Closed amount, Date, Volume
While we anticipated this project to be complicated, the issues we came across made it quite a challenge task. The most difficult challenge was the data limit. We tried over 12+ APIs and yet failed to get data beyond a couple of months on a free subscription.
As future improvement, we can fetch more historical data and create our model using ARIMA technique. While we used news articles to ground our sentiments around facts. If we enrich the dataset by fetching data from social media sites like reddit and twitter that will give the sentiment values more depth.