The 5-W Theory

What is Sentiment Analysis?

Sentiment plays an important role in human’s life. Emotions are something that human used to expressed in various forms.

Since the begining of the social media, it provided a nice platform to express joy, excitment, anger, love, hate and so on.

According to Google Trends, the word “sentiment analysis” has been gaining steady traction over the past 5 years. Sentiment refers to the attitude expressed by an individual regarding a certain topic. Below google trend shows use of sentiments in twitter’s tweets.

Reference : https://trends.google.com/trends/explore?date=today%205-y&q=sentiment%20analysis%20twitter

Why sentiment analysis?

‘Sentiment Analysis’ of tweets is an approach to forecast the stock market.

Below are few of the sentiments which helps in potential gain for a stock.

  • Company’s merging/acquisition is a sign for good growth, sentiments used like ‘better future’, ‘more is good’, ‘hope for future’, etc.
  • Company’s quaterly or annual positive results published, sentiments used like ‘great scope’, ‘a long journey’, ‘way to go’, etc.
  • Company’s leadership changed for good, sentiments used like ‘change is for good’, ‘new hope’ etc.
  • Favoring government policies, sentiments used like ‘goverment is for good’, ‘go go gov’ etc.

Below are few of the sentiments which helps in potential loss for a stock.

  • Company lost a big contract, sentiments may be used like ‘try next time’, ‘not capable’, etc.
  • Company’s quaterly or annual result published which are not good, sentiments may be used like ‘keep an eye’, ‘a failed place’, ‘Boo’, etc.
  • Company lost some assets, like oil spill, sentiments may be used like ‘lost trust’, ‘gods mercy’ etc.

Who uses sentiment analysis?

  • Researcher, psychiatrist and human readers are uses the sentiment analysis in general and for our study a market predictor, capital investor may use sentiment analysis for predicting the stock market.

How to do sentiment analysis?

  • An automated system, which reads the twitter’s social feeds continuously. These tweets should be processed, filter and separated in words. These words should be compare against the pre-defined set of sentiment words like ‘Positive’ or ‘Negative’ and provided score.
  • A company’s social feed is providing a higher positive sentiments tend to gain the market in near short term future and a higher negative sentiments tend to loss the market.
  • We are showing this theroy in shorter scale and trying to read the tweets from twitter for gainer and loser for the current market.
  • We are using R programming to make an automated system. I used R version 3.5.3 on Windows 10 operating system.

Which stocks for sentiment analysis?

  • For simplicity, we are going to get top three gainers and top three losers from market.
  • Cashtag (stock symbol followed by ‘$’ sign, like $MSFT, $PFE ) is the best way to refer company’s tweet. We are going to use cashtag as a primary source to read tweets.
  • For sampling we are going to take 100 tweets (excluded retweets) and it may possible that we would not find more tweets with cashtag, in such case we uses company’s name for finding tweets.
  • We are including markets from NYSE, NASDAQ and AMEX, we are using this to get company profiles.
  • We can get Top Gainers and Top Losers stock details either from http://finance.yahoo.com/ or http://www.google.com/finance .

Project Solution

Below are technical aspects for achieving this analysis.

Custom Functions

  • getTweetsFromHandle : It takes twitter search term and returns tweets from twitter API. It uses provided twitter authentication. It also removes the links and special characters from the stripped text and returns original tweet object.
  • getTweetsFromCompany : It takes companyInfo object and internally calls ‘getTweetsFromHandle’ function. At first it uses cashtag and if not finding enought tweet then uses Company Name as a search term.
  • getCorpusFromTweets : It takes tweeter text and returns corpus object. We will see details about this function later in this document.
  • getProcessedCorpus : It takes corpus, processes it and return processed corpus. It also removes pre-defined custom words from text, like th, thi, thow etc. We will see details about this function later in this document.
  • getMostFrequentWords : It takes list of words and returns most frequent word with their frequency.
  • sentiment_bing_from_words : It takes words and applies bing sentiments on it. It returns sentiment and it’s score.
  • wordCloudWithSentiments : It shows word clouds specifically for Negative vs Positive sentiments.

Pre Execution Steps

  • This code was created in a way, where it saves searched and corpus twitter data into separate data file (.RData). If you wish to load existing data, you must mark liveTweets <- FALSE in below code block.
  • From Yahoo Finance, I picked todays top gainers and losers, if you wish you may change it.
  • There are few custom words defined to remove from corpus, if you wish you may provide more.
setwd("D:/PersonalStuff/Rushi/Study/MS_BU_MET/CS688/TermProject/")
liveTweets <- TRUE
startDate <- '2019-04-18'

todaysGainers <- c("HELE","RGEN","F")
todaysLosers <- c("INTC","TRQ","FII")
todaysTrending <- c("TSLA","FB","MSFT")
todaysMostActive <- c("SNAP","SIRI","AMD")
customWords = c("th","thi","thow")

Authentication to Twitter Account. (I hide code and output intentionally to keep the privacy.)

  • For calculation and visualization, I am getting current stock information by calling getQuote API.
  • For searching twitter API, we are primarly using cashtag (that means search term follwed by $ sign, example $INTC, $HELE etc.).
  • Our goal is to get 100 tweets for single stock, and we may not get enough tweets from cashtag only, we might need to search twitter using company name or respective twitter handle.
  • We are calling stockSymbolsAPI from TTR library for getting company header information.

  • Below table shows today’s company profile available for our stocks.

kable(allstockData[round(runif(10, 50, 500)),], caption = "Company Profile for some random stocks with their Industry and exchange name")
Company Profile for some random stocks with their Industry and exchange name
Symbol Name LastSale MarketCap IPOyear Sector Industry Exchange
94 ERH Wells Fargo Utilities and High Income Fund 12.9800 $120.14M 2004 NA NA AMEX
301 AAON AAON, Inc. 50.2100 $2.61B NA Capital Goods Industrial Machinery/Components NASDAQ
66 CUO Continental Materials Corporation 18.4931 $31.07M NA Capital Goods Building Materials AMEX
350 ADMS Adamas Pharmaceuticals, Inc. 6.3200 $173.93M 2014 Health Care Major Pharmaceuticals NASDAQ
338 ACTG Acacia Research Corporation 3.1800 $157.88M NA Miscellaneous Multi-Sector Companies NASDAQ
191 NGD New Gold Inc. 0.8799 $509.56M NA Basic Industries Precious Metals AMEX
120 GLU-PB The Gabelli Global Utility and Income Trust 52.4600 NA NA NA NA AMEX
382 AGMH AGM Group Holdings Inc. 17.9400 $595.9M 2018 Technology EDP Services NASDAQ
85 EIM Eaton Vance Municipal Bond Fund 12.4400 $843.95M 2002 NA NA AMEX
433 ALOT AstroNova, Inc. 25.0200 $174.94M 1983 Technology Computer peripheral equipment NASDAQ
  • Below tables show splitted information for Today’s gainers and losers with their current stock price in (USD).
kable(todaysGainers, caption = "Today's Gainers")
Today’s Gainers
Trade Time Last stock Name LastSale MarketCap IPOyear Sector Industry Exchange
2019-04-30 16:00:01 144.00 HELE Helen of Troy Limited 144.00 $3.69B NA Consumer Durables Home Furnishings NASDAQ
2019-04-30 16:00:01 67.38 RGEN Repligen Corporation 67.38 $2.96B 1986 Health Care Biotechnology: Biological Products (No Diagnostic Substances) NASDAQ
2019-04-30 16:00:53 10.45 F Ford Motor Company 10.45 $41.69B NA Capital Goods Auto Manufacturing NYSE
kable(todaysLosers, caption = "Today's Loser")
Today’s Loser
Trade Time Last stock Name LastSale MarketCap IPOyear Sector Industry Exchange
2019-04-30 16:00:01 51.04 INTC Intel Corporation 51.04 $228.51B NA Technology Semiconductors NASDAQ
2019-04-30 16:02:03 1.50 TRQ Turquoise Hill Resources Ltd. 1.50 $3.02B NA Basic Industries Precious Metals NYSE
2019-04-30 16:04:17 30.73 FII Federated Investors, Inc. 30.73 $3.1B 1998 Finance Investment Managers NYSE

Search Tweets

We searched 100 tweets associated with each of three stocks for gainer and loser set.

Search Tweets by Cashtag and/or Company Name

if(liveTweets == TRUE){

  #Get tweets for Ganers      
    tweets.gainer1 = getTweetsFromCompany(todaysGainers[which(todaysGainers$stock == todaysGainers$stock[1]),])
    tweets.gainer2 = getTweetsFromCompany(todaysGainers[which(todaysGainers$stock == todaysGainers$stock[2]),])
    tweets.gainer3 = getTweetsFromCompany(todaysGainers[which(todaysGainers$stock == todaysGainers$stock[3]),])
  
  #Get tweets for Losers
    tweets.loser1 = getTweetsFromCompany(todaysLosers[which(todaysLosers$stock == todaysLosers$stock[1]),])
    tweets.loser2 = getTweetsFromCompany(todaysLosers[which(todaysLosers$stock == todaysLosers$stock[2]),])
    tweets.loser3 = getTweetsFromCompany(todaysLosers[which(todaysLosers$stock == todaysLosers$stock[3]),])
    
}

Combine all gainers and loser tweets into set of 300 tweets

if(liveTweets == TRUE){
    #Create a set of 300 tweets for top three gainers
    paste("There are total of",  nrow(gainerTweets) , "tweets found for today's top 3 gainers")
  
   #Create a set of 300 tweets for top three gainers
    paste("There are total of",  nrow(loserTweets) , "tweets found for today's top 3 losers")
}
## [1] "There are total of 300 tweets found for today's top 3 losers"

Create Data Corpora and Save as Data



if(liveTweets == TRUE){
    data.corpus1 <- getCorpusFromTweets(gainerTweets$stripped_text)
    data.corpus2 <- getCorpusFromTweets(loserTweets$stripped_text)
    
    head(data.corpus1)
    head(data.corpus2)
}
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 6
if(liveTweets == TRUE){
    save(data.corpus1, file=paste0("data/","stock_gainer_corpus_", Sys.Date(), ".RData"))
    save(data.corpus2, file=paste0("data/","stock_loser_corpus_", Sys.Date(), ".RData"))
}
if(liveTweets == FALSE){
    load(paste0("data/","stock_gainer_corpus_", Sys.Date(), ".RData"))
    load(paste0("data/","stock_loser_corpus_", Sys.Date(), ".RData"))
}

Processing the Corpus



#process the corpus
gainer.processedCorpus <- getProcessedCorpus(data.corpus1)
loser.processedCorpus <- getProcessedCorpus(data.corpus2)

Make a Document Term Matrix (DTM)



#Below code creates DTM where control parameter passed as remove numbers and minimum of 2 character length of words
#Optionally we can pass other parameters like bounds, which only includes documents if a word inccluded in specified documents
gainer.DTM <- DocumentTermMatrix(gainer.processedCorpus, control = list(
  removeNumbers = TRUE, #Remove numbers
  wordLengths=c(2,Inf)  # words between 3 and 20 characters long
  #bounds=list(global=c(20,Inf))  # only include words in DTM if they happen in 20 or more documents
))

loser.DTM <- DocumentTermMatrix(loser.processedCorpus, control = list(
  removeNumbers = TRUE, #Remove numbers
  wordLengths=c(2,Inf)  # words between 3 and 20 characters long
  #bounds=list(global=c(20,Inf))  # only include words in DTM if they happen in 20 or more documents
))

WordCloud of Terms

loser.DTM <- as.matrix(loser.DTM) # Document term matrix
gainer.DTM <- as.matrix(gainer.DTM) # Document term matrix
gainer.wordFrequency <- colSums(gainer.DTM)
gainer.wordOrder <- order(gainer.wordFrequency, decreasing = TRUE) # Ordering the frequencies

loser.wordFrequency <- colSums(loser.DTM)
loser.wordOrder <- order(loser.wordFrequency, decreasing = TRUE) # Ordering the frequencies

Word Frequency of Gainers Stocks vs Losers Stocks

Gainer Stocks: HELE, RGEN, F
x
stock 141
aapl 113
get 113
like 107
join 106
use 106
bo 102
free 102
make 102
robinhoodapp 102
Loser Stocks: INTC, TRQ, FII
x
turquois 88
feder 70
intel 53
intc 46
fii 35
wi 33
amd 24
aapl 23
stock 23
silver 22

Word Cloud from Tweets between Gainers vs Losers


Sentiment Analysis

Below plot compares sentiment scores between Gainers’ sentiments vs Losers’ sentiments

Sentiment Score Summary

Sentiment Summary Table
stock Count Mean SD max min
Gainer 324 0.4104938 0.7090837 3 -3
Loser 300 0.0933333 1.0172207 5 -4

Sentiment Comparison Cloud

## Joining, by = "word"
## Joining, by = "word"


Changes in stock in Recent Times

Stock Growth of all our symbols since 2019-04-18

Comparison of returns on $1 spend on our stocks on 2019-04-18



Thank You