CS 688 - Term Project : Market Sentiment Analysis from Tweets

The 5-W Theory

What is Sentiment Analysis?

Sentiment plays an important role in human’s life. Emotions are something that human used to expressed in various forms.

Since the begining of the social media, it provided a nice platform to express joy, excitment, anger, love, hate and so on.

According to Google Trends, the word “sentiment analysis” has been gaining steady traction over the past 5 years. Sentiment refers to the attitude expressed by an individual regarding a certain topic. Below google trend shows use of sentiments in twitter’s tweets.

Reference : https://trends.google.com/trends/explore?date=today%205-y&q=sentiment%20analysis%20twitter

Why sentiment analysis?

‘Sentiment Analysis’ of tweets is an approach to forecast the stock market.

Below are few of the sentiments which helps in potential gain for a stock.

Company’s merging/acquisition is a sign for good growth, sentiments used like ‘better future’, ‘more is good’, ‘hope for future’, etc.
Company’s quaterly or annual positive results published, sentiments used like ‘great scope’, ‘a long journey’, ‘way to go’, etc.
Company’s leadership changed for good, sentiments used like ‘change is for good’, ‘new hope’ etc.
Favoring government policies, sentiments used like ‘goverment is for good’, ‘go go gov’ etc.

Below are few of the sentiments which helps in potential loss for a stock.

Company lost a big contract, sentiments may be used like ‘try next time’, ‘not capable’, etc.
Company’s quaterly or annual result published which are not good, sentiments may be used like ‘keep an eye’, ‘a failed place’, ‘Boo’, etc.
Company lost some assets, like oil spill, sentiments may be used like ‘lost trust’, ‘gods mercy’ etc.

Who uses sentiment analysis?

Researcher, psychiatrist and human readers are uses the sentiment analysis in general and for our study a market predictor, capital investor may use sentiment analysis for predicting the stock market.

How to do sentiment analysis?

An automated system, which reads the twitter’s social feeds continuously. These tweets should be processed, filter and separated in words. These words should be compare against the pre-defined set of sentiment words like ‘Positive’ or ‘Negative’ and provided score.
A company’s social feed is providing a higher positive sentiments tend to gain the market in near short term future and a higher negative sentiments tend to loss the market.
We are showing this theroy in shorter scale and trying to read the tweets from twitter for gainer and loser for the current market.
We are using R programming to make an automated system. I used R version 3.5.3 on Windows 10 operating system.

Which stocks for sentiment analysis?

For simplicity, we are going to get top three gainers and top three losers from market.
Cashtag (stock symbol followed by ‘$’ sign, like $MSFT, $PFE ) is the best way to refer company’s tweet. We are going to use cashtag as a primary source to read tweets.
For sampling we are going to take 100 tweets (excluded retweets) and it may possible that we would not find more tweets with cashtag, in such case we uses company’s name for finding tweets.
We are including markets from NYSE, NASDAQ and AMEX, we are using this to get company profiles.
We can get Top Gainers and Top Losers stock details either from http://finance.yahoo.com/ or http://www.google.com/finance .

Project Solution

Below are technical aspects for achieving this analysis.

R Version 3.5.3
R Studio on Windows 10 Machine
R Libraries : ‘quantmod’,‘dplyr’,‘TTR’,‘rtweet’,‘tm’,‘wordcloud’,‘tidytext’,‘purrr’,‘ggplot2’,‘knitr’,‘tibble’ and ‘tidyr’
“CS688_termProject_JoshiR.rmd” created for writing R code and preparing this document.
For a better approach, all the custom functions wrote in separate functions.R file and for the privacy, twitter authentication details wrote in twitterAuth.R file. These files loaded using source into this RMD file, below is code block for same.

Custom Functions

getTweetsFromHandle : It takes twitter search term and returns tweets from twitter API. It uses provided twitter authentication. It also removes the links and special characters from the stripped text and returns original tweet object.
getTweetsFromCompany : It takes companyInfo object and internally calls ‘getTweetsFromHandle’ function. At first it uses cashtag and if not finding enought tweet then uses Company Name as a search term.
getCorpusFromTweets : It takes tweeter text and returns corpus object. We will see details about this function later in this document.
getProcessedCorpus : It takes corpus, processes it and return processed corpus. It also removes pre-defined custom words from text, like th, thi, thow etc. We will see details about this function later in this document.
getMostFrequentWords : It takes list of words and returns most frequent word with their frequency.
sentiment_bing_from_words : It takes words and applies bing sentiments on it. It returns sentiment and it’s score.
wordCloudWithSentiments : It shows word clouds specifically for Negative vs Positive sentiments.

Pre Execution Steps

This code was created in a way, where it saves searched and corpus twitter data into separate data file (.RData). If you wish to load existing data, you must mark liveTweets <- FALSE in below code block.
From Yahoo Finance, I picked todays top gainers and losers, if you wish you may change it.
There are few custom words defined to remove from corpus, if you wish you may provide more.

setwd("D:/PersonalStuff/Rushi/Study/MS_BU_MET/CS688/TermProject/")
liveTweets <- TRUE
startDate <- '2019-04-18'

todaysGainers <- c("HELE","RGEN","F")
todaysLosers <- c("INTC","TRQ","FII")
todaysTrending <- c("TSLA","FB","MSFT")
todaysMostActive <- c("SNAP","SIRI","AMD")
customWords = c("th","thi","thow")

Authentication to Twitter Account. (I hide code and output intentionally to keep the privacy.)

For calculation and visualization, I am getting current stock information by calling getQuote API.
For searching twitter API, we are primarly using cashtag (that means search term follwed by $ sign, example $INTC, $HELE etc.).
Our goal is to get 100 tweets for single stock, and we may not get enough tweets from cashtag only, we might need to search twitter using company name or respective twitter handle.
We are calling stockSymbolsAPI from TTR library for getting company header information.
Below table shows today’s company profile available for our stocks.

kable(allstockData[round(runif(10, 50, 500)),], caption = "Company Profile for some random stocks with their Industry and exchange name")

Company Profile for some random stocks with their Industry and exchange name
	Symbol	Name	LastSale	MarketCap	IPOyear	Sector	Industry	Exchange
94	ERH	Wells Fargo Utilities and High Income Fund	12.9800	$120.14M	2004	NA	NA	AMEX
301	AAON	AAON, Inc.	50.2100	$2.61B	NA	Capital Goods	Industrial Machinery/Components	NASDAQ
66	CUO	Continental Materials Corporation	18.4931	$31.07M	NA	Capital Goods	Building Materials	AMEX
350	ADMS	Adamas Pharmaceuticals, Inc.	6.3200	$173.93M	2014	Health Care	Major Pharmaceuticals	NASDAQ
338	ACTG	Acacia Research Corporation	3.1800	$157.88M	NA	Miscellaneous	Multi-Sector Companies	NASDAQ
191	NGD	New Gold Inc.	0.8799	$509.56M	NA	Basic Industries	Precious Metals	AMEX
120	GLU-PB	The Gabelli Global Utility and Income Trust	52.4600	NA	NA	NA	NA	AMEX
382	AGMH	AGM Group Holdings Inc.	17.9400	$595.9M	2018	Technology	EDP Services	NASDAQ
85	EIM	Eaton Vance Municipal Bond Fund	12.4400	$843.95M	2002	NA	NA	AMEX
433	ALOT	AstroNova, Inc.	25.0200	$174.94M	1983	Technology	Computer peripheral equipment	NASDAQ

Below tables show splitted information for Today’s gainers and losers with their current stock price in (USD).

kable(todaysGainers, caption = "Today's Gainers")

Today’s Gainers
Trade Time	Last	stock	Name	LastSale	MarketCap	IPOyear	Sector	Industry	Exchange
2019-04-30 16:00:01	144.00	HELE	Helen of Troy Limited	144.00	$3.69B	NA	Consumer Durables	Home Furnishings	NASDAQ
2019-04-30 16:00:01	67.38	RGEN	Repligen Corporation	67.38	$2.96B	1986	Health Care	Biotechnology: Biological Products (No Diagnostic Substances)	NASDAQ
2019-04-30 16:00:53	10.45	F	Ford Motor Company	10.45	$41.69B	NA	Capital Goods	Auto Manufacturing	NYSE

kable(todaysLosers, caption = "Today's Loser")

Today’s Loser
Trade Time	Last	stock	Name	LastSale	MarketCap	IPOyear	Sector	Industry	Exchange
2019-04-30 16:00:01	51.04	INTC	Intel Corporation	51.04	$228.51B	NA	Technology	Semiconductors	NASDAQ
2019-04-30 16:02:03	1.50	TRQ	Turquoise Hill Resources Ltd.	1.50	$3.02B	NA	Basic Industries	Precious Metals	NYSE
2019-04-30 16:04:17	30.73	FII	Federated Investors, Inc.	30.73	$3.1B	1998	Finance	Investment Managers	NYSE

Search Tweets

We searched 100 tweets associated with each of three stocks for gainer and loser set.

Search Tweets by Cashtag and/or Company Name

if(liveTweets == TRUE){

  #Get tweets for Ganers      
    tweets.gainer1 = getTweetsFromCompany(todaysGainers[which(todaysGainers$stock == todaysGainers$stock[1]),])
    tweets.gainer2 = getTweetsFromCompany(todaysGainers[which(todaysGainers$stock == todaysGainers$stock[2]),])
    tweets.gainer3 = getTweetsFromCompany(todaysGainers[which(todaysGainers$stock == todaysGainers$stock[3]),])
  
  #Get tweets for Losers
    tweets.loser1 = getTweetsFromCompany(todaysLosers[which(todaysLosers$stock == todaysLosers$stock[1]),])
    tweets.loser2 = getTweetsFromCompany(todaysLosers[which(todaysLosers$stock == todaysLosers$stock[2]),])
    tweets.loser3 = getTweetsFromCompany(todaysLosers[which(todaysLosers$stock == todaysLosers$stock[3]),])
    
}

Combine all gainers and loser tweets into set of 300 tweets

if(liveTweets == TRUE){
    #Create a set of 300 tweets for top three gainers
    paste("There are total of",  nrow(gainerTweets) , "tweets found for today's top 3 gainers")
  
   #Create a set of 300 tweets for top three gainers
    paste("There are total of",  nrow(loserTweets) , "tweets found for today's top 3 losers")
}

## [1] "There are total of 300 tweets found for today's top 3 losers"

Create Data Corpora and Save as Data

A corpus is collection of all texts.
A function getCorpusFromTweets created, which uses VCorpus and creates corpus for provided text.
This function takes multiple text. Pass tweets for gainer and loser and generate two separate set of corpus.

if(liveTweets == TRUE){
    data.corpus1 <- getCorpusFromTweets(gainerTweets$stripped_text)
    data.corpus2 <- getCorpusFromTweets(loserTweets$stripped_text)
    
    head(data.corpus1)
    head(data.corpus2)
}

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 6

If liveTweets flag set as TRUE then system saves obtained corpus for each set into individual file.
System saves files with today’s date. This helps in working on older tweets on later date.

if(liveTweets == TRUE){
    save(data.corpus1, file=paste0("data/","stock_gainer_corpus_", Sys.Date(), ".RData"))
    save(data.corpus2, file=paste0("data/","stock_loser_corpus_", Sys.Date(), ".RData"))
}

If liveTweets flag set as FALSE then by default system loads today’s already saved corpus. Make sure a file is available with today’s date under data folder.

if(liveTweets == FALSE){
    load(paste0("data/","stock_gainer_corpus_", Sys.Date(), ".RData"))
    load(paste0("data/","stock_loser_corpus_", Sys.Date(), ".RData"))
}

Processing the Corpus

A social feed, which is specially open, generic and contains varity of data often packed with lot of unwanted texts like http links, spceial characters, white spaces etc., we need to exclude them.
Numbers, few repeatative words like ‘the’, ‘an’, ‘this’ etc. and punctuation also have to be excluded from the text.
We uses tm_map function from tm library for processing the text.
We created ‘getProcessedCorpus’ function to process the corpus.

#process the corpus
gainer.processedCorpus <- getProcessedCorpus(data.corpus1)
loser.processedCorpus <- getProcessedCorpus(data.corpus2)

Make a Document Term Matrix (DTM)

Document term matrix is a mathematical matrix, which stores term(word) in a collection of document with it’s frequency.
DTM uses Latent Semantic Indexing and stores data into matrix for word counts per document.
Refer : https://medium.com/@camrongodbout/creating-a-search-engine-f2f429cab33c
We uses DocumentTermMatrix function from tm library.
Below code generates DTM separately and marked to consdier words with minimum leangth of 2.

#Below code creates DTM where control parameter passed as remove numbers and minimum of 2 character length of words
#Optionally we can pass other parameters like bounds, which only includes documents if a word inccluded in specified documents
gainer.DTM <- DocumentTermMatrix(gainer.processedCorpus, control = list(
  removeNumbers = TRUE, #Remove numbers
  wordLengths=c(2,Inf)  # words between 3 and 20 characters long
  #bounds=list(global=c(20,Inf))  # only include words in DTM if they happen in 20 or more documents
))

loser.DTM <- DocumentTermMatrix(loser.processedCorpus, control = list(
  removeNumbers = TRUE, #Remove numbers
  wordLengths=c(2,Inf)  # words between 3 and 20 characters long
  #bounds=list(global=c(20,Inf))  # only include words in DTM if they happen in 20 or more documents
))

We found 1436 words from 324 documents from Gainers Dataset.
We found 2356 words from 300 documents from Losers Dataset.

WordCloud of Terms

The sheer number of words is not possible and not making sense to analyze. We must first find frequency of each word.
We created getMostFrequentWords function to get frequency of words.

loser.DTM <- as.matrix(loser.DTM) # Document term matrix
gainer.DTM <- as.matrix(gainer.DTM) # Document term matrix
gainer.wordFrequency <- colSums(gainer.DTM)
gainer.wordOrder <- order(gainer.wordFrequency, decreasing = TRUE) # Ordering the frequencies

loser.wordFrequency <- colSums(loser.DTM)
loser.wordOrder <- order(loser.wordFrequency, decreasing = TRUE) # Ordering the frequencies

Word Frequency of Gainers Stocks vs Losers Stocks

Gainer Stocks: HELE, RGEN, F
	x
stock	141
aapl	113
get	113
like	107
join	106
use	106
bo	102
free	102
make	102
robinhoodapp	102

Loser Stocks: INTC, TRQ, FII
	x
turquois	88
feder	70
intel	53
intc	46
fii	35
wi	33
amd	24
aapl	23
stock	23
silver	22

Word Cloud from Tweets between Gainers vs Losers

Sentiment Analysis

Sentiment shows the mood of people. Sentiments categorized in Positive and Negative. Positive sentiments shows positive attitude and negative sentiments show negative attitude.
We are using predefined sentiment words described by Bing Liu, categorized into 2006 positive words and 4783 negative words.

Below plot compares sentiment scores between Gainers’ sentiments vs Losers’ sentiments

Above comparison shows that stockes which are gainers has more positive sentiments compare to losers.

Sentiment Score Summary

Below table indicates number of sentiment found in all tweets, max and min shows their commulitive score.

Sentiment Summary Table
stock	Count	Mean	SD	max	min
Gainer	324	0.4104938	0.7090837	3	-3
Loser	300	0.0933333	1.0172207	5	-4

Sentiment Comparison Cloud

## Joining, by = "word"
## Joining, by = "word"

Changes in stock in Recent Times

Stock Growth of all our symbols since 2019-04-18

Comparison of returns on $1 spend on our stocks on 2019-04-18

References

https://www.tidytextmining.com/sentiment.html