Stock Sentiment

In this project, I aim to analyze Twitter data in an effort to gauge sentiment towards the best and worst performing stocks.

Each day, certain stocks perform the best. The “best” being the metric of which stocks experience the greatest percent change according to Yahoo Finance.

On April 22nd, 2019, the three best performing stocks were:

Medidata Solutions Inc., an American technology company that develops and markets software as a service for clinical trials. Their software suite includes software for protocol development as well as software to capture patient data through web means.
Rite Aid, the largest drugstore chain on the east coast, third in the U.S. The company was listed 94th in the 2019 Fortune 500 list in terms of total revenue.
Dropbox, a web-based file hosting service. Users upload documents/items to Dropbox cloud servers that can be accessed by other users on personal client computers. This allows multiple users to access the most up-to-date files.

Conversely, the worst performing stocks, those that suffered the most negative percent change, which were:

GrafTech International, a manufacturer of graphite electrode that are essential to the metal making process.
Guardant Health, an oncology company that works to help conquer cancer by providing data sets for advanced analytics.
LexinFintech, a China based company that is an online consumer finance platform.

With these stocks in mind, the next step is to scrape Twitter itself for tweets who mention the specific stocks. However, not every tweet will reference the stocks by their full name. Often times tweets will use abbreviations, nicknames or even the ticker symbol to reference stocks.

I decided to search for Tweets via three options: cashtags, hashtags of the company name and company mentions. I originally wanted to explore using only cashtags as this would provide stock market opinions on how each stock is doing but quickly found that it was difficult to scrape enough tweets. I opted to expand to hashtags of the company name as well as mentions of the company. I expanded to include these two options because typically users will not use cahstags when tweeting about a company; it is more common to use hashtags and mentions. Searching with hashtags and mentions also provides a better base for sentiment analyses as people will more likely use more opinion based words in a tweet with hashtags and mentions.

Once I had a list of the search terms for each stock, I scraped Twitter for tweets that contained any of the aforementioned search terms and limited the list to the most recent. I also wanted to make sure each stock returned at least 100 tweets but given the obscurity of some companies, the sparity of tweets meant that I had to extend the search period certain stocks such as Graftech International and Guardant Health.

mdso.tweets <- search_tweets2(searchterms.mdso, n=150/length(searchterms.mdso), include_rts=FALSE, lang='en')
dbx.tweets <- search_tweets2(searchterms.dbx, n=130/length(searchterms.dbx), include_rts=FALSE, lang='en')
rad.tweets <- search_tweets2(searchterms.rad, n=110/length(searchterms.rad), include_rts=FALSE, lang='en')
eaf.tweets <- search_tweets2(searchterms.eaf, n=210/length(searchterms.eaf), include_rts=FALSE, lang='en')
gh.tweets <- search_tweets2(searchterms.gh, n=220/length(searchterms.gh), include_rts=FALSE, lang='en')
lx.tweets <- search_tweets2(searchterms.lx, n=220/length(searchterms.lx), include_rts=FALSE, lang='en')

The result of the scraping are 6 lists of tweets, one for each stock. We now have the dataset we will use to perform sentiment analysis to determine the overall feelings the twitterverse has for each company.

But wait! Tweets are writter in English and most of them contain slang and jargon and extra characters. So we still need to do some processing in order to end up with clean data. As always, the output is only as good as the input!

The first step of pre-processing is to add an identifier so we can easily tell which tweet is in regards to which stock. To do so, I decied to add the ticker symbol as a variable to each tweet with a function.

mdso.string <- c('MDSO')
rad.string <- c('RAD')
dbx.string <- c('DBX')
eaf.string <- c('EAF')
gh.string <- c('GH')
lx.string <- c('LX')

to.head <- function(x, z) {
  x <- head(x, 100)
  x$projectid <- substr(z, 0, nchar(z))
  return(x)
}

The function takes the first 100 tweets in the list and adds a separately initialized string of the ticket symbole specific to each stock.

mdso.100 <- to.head(mdso, mdso.string)
rad.100 <- to.head(rad, rad.string)
dbx.100 <- to.head(dbx, dbx.string)
eaf.100 <- to.head(eaf, eaf.string)
gh.100 <- to.head(gh, gh.string)
lx.100 <- to.head(lx, lx.string)

I passed the two variables, the list of tweets and the ticker symbol string, through the function so each tweet now has an additional “projectID” variable to idenfity which stock it is referencing.

combined.gain <- as.vector(rbind(mdso.100, rad.100, dbx.100))
combined.lose <- as.vector(rbind(eaf.100, gh.100, lx.100))

to.corpus <- function(x) {
  corpus.object <- VCorpus(VectorSource(x$text))
}

data.corpus1 <- to.corpus(combined.gain)
data.corpus2 <- to.corpus(combined.lose)

Once I had the tweets with an ID, the next step was to combine all of the tweets for the largest gaining stocks into one list and the tweets for the largest losing stock into another. Once they were combined, I wrote a function that takes a tweet and returns a corpus object to better perform language processing.

pre_process <- function(x) {
  toSpace <- content_transformer(function(x, pattern) gsub(pattern, '', x))
  corpus.clean <- tm_map(x, content_transformer(tolower))
  corpus.clean <- tm_map(corpus.clean, toSpace, 'http\\S+')
  corpus.clean <- tm_map(corpus.clean, toSpace, '[^\u0020-\u007F]+')
  corpus.clean <- tm_map(corpus.clean, toSpace, '&\\S+')
  corpus.clean <- tm_map(corpus.clean, toSpace, '@#')
  corpus.clean <- tm_map(corpus.clean, toSpace, '\\d+')
  corpus.clean <- tm_map(corpus.clean, removeWords, c(stopwords('English')))
  corpus.clean <- tm_map(corpus.clean, removePunctuation)}

Phew, everything up until now was just to wrangle the data into workable datasets. The above code actually begins to tackle pre-processing the tweets so that sentiment analysis can run on meaningful words and not symbols, foreign characters, urls, etc.

To do this elegantly, I wrote a function so that each pre-process step was run on the entire corpus. For example, in the body of tweets, there are often times pasted links to various websites. The first pre-processing step removes these hyperlinks since they typically add no value to the tweet in terms of sentiment scores.

Once the pre-processing function is run on each corpus, the result are two corpi, each with clean tweets. Finally, are are able to begin the data analysis and calculate the sentiment scores.

#dtm function
to.dtm <- function(x) {
  x.dtm <- DocumentTermMatrix(x, control = list(
    wordLengths=c(2, Inf),
    bounds=list(global=c(5, Inf))
  ))
}

dtm1 <- to.dtm(data_corpus1_clean)

dtm2 <- to.dtm(data_corpus2_clean)

The first step in data analysis is to convert the tweets into document term matrix in order to see the frequency of each word that appears at all within the set of tweets.

Once I have the word list, I can determine the most common words, shown with the below wordcloud. One for the tweets regarding the biggest gainers and one for the tweets regarding the biggest losers.

##  riteaid  dropbox     life      aid     rite medidata     mdso    short 
##       68       66       64       62       62       52       47       38 
##  company      job 
##       35       35

##           eaf      graftech international        health            gh 
##            91            80            78            75            74 
##      guardant            lx  lexinfintech           ltd       fintech 
##            74            63            58            49            38

The wordclouds show us the most frequent words but there’s no context to those words. We don’t know if those words were used positively or negatively.

Here is where sentiment analysis comes in.

#sentiment score
sentiment_bing = function(twt){
  #Step 1;  perform basic text cleaning (on the tweet), as seen earlier
  twt_tbl = tibble(text = twt) %>% 
    mutate(
      # Remove http elements manually
      stripped_text = gsub("http\\S+","",text)
    ) %>% 
    unnest_tokens(word,stripped_text) %>% 
    anti_join(stop_words, by="word") %>%  #remove stop words
    inner_join(get_sentiments("bing"), by="word") %>% # merge with bing sentiment
    count(word, sentiment, sort = TRUE) %>% 
    ungroup() %>% 
    ## Create a column "score", that assigns a -1 one to all negative words, and 1 to positive words. 
    mutate(
      score = case_when(
        sentiment == 'negative'~ n*(-1),
        sentiment == 'positive'~ n*1)
    )
  ## Calculate total score
  sent.score = case_when(
    nrow(twt_tbl)==0~0, # if there are no words, score is 0
    nrow(twt_tbl)>0~sum(twt_tbl$score) #otherwise, sum the positive and negatives
  )
  ## This is to keep track of which tweets contained no words at all from the bing list
  zero.type = case_when(
    nrow(twt_tbl)==0~"Type 1", # Type 1: no words at all, zero = no
    nrow(twt_tbl)>0~"Type 2" # Type 2: zero means sum of words = 0
  )
  list(score = sent.score, type = zero.type, twt_tbl = twt_tbl)
}

Using the bing lexicon, I wrote a function to assign each word within a tweet a positive or negative score. Once each word within a tweet is scored, the scores are summed and the resulting score is the overal sentiment score of the specific tweet. The function does this for every tweet and the end output is an overall sentiment score.

gain <- c('gain')
lose <- c('lose')

gain_sent <- lapply(combined.gain$text, function(x) {sentiment_bing(x)})
lose_sent <- lapply(combined.lose$text, function(x) {sentiment_bing(x)})

#combine those sentiment scores into one
combined_sentiment = bind_rows(
  tibble(
    team = gain,
    score = unlist(map(gain_sent,'score')),
    type = unlist(map(gain_sent,'type'))
  ),
  tibble(
    team = lose,
    score = unlist(map(lose_sent,'score')),
    type = unlist(map(lose_sent,'type'))
  )
)

#summary statistics
combined_sentiment %>% #filter(type != 'Type 1') %>% 
  filter(score != 'NA') %>%
  group_by(team) %>% 
  summarise(
    Count = n(),
    Mean = mean(score),
    SD = sd(score),
    max = max(score),
    min = min(score)
  )

The result shows that for each set of tweets, gainers and losers, there is no overtly positive or negative sentiment. The tweets regarding the stocks that gained the most were not overtly positive as one would expect. The same can be said for the tweets regarding the stocks that lost the most. The tweets were not overtly negative.

The contributing factors to the surprising result could be down to the dataset itself. The companies that were best and worst performing were relatively obscure to most people. This could possibly have led to non-opinionated tweets. In addition, the Bing lexicon only measures if a word is positive or negative. Various other lexicons can provide more detailed information. For example, the AFINN lexicon increases the range from -1/+1 from Bing to -5/+5. The nrc lexicon also broadens the score from just positive and negative to other categories such as anger, fear, joy or trust.

Along with the sentiment analysis, I decided to do a couple of additional analyses to take a look at stock performances.

The first analysis was to take a look at the number of trades conducted on April 22nd, 2019. What I found was that the stocks in the gain group had a higher median number of trades than those in the lose group. This tells me that there was more activity in the gain group of stocks during the day in which the stocks peaked.

numberOfTrades.df.gain <- rbind(mdso.day, rad.day, dbx.day)
ggplot(numberOfTrades.df.gain, aes(x=stock, y=numberOfTrades)) +
  geom_boxplot(aes(fill=stock), outlier.shape = NA) +
  scale_fill_manual(values=wes_palette(n=3, name='Darjeeling1')) +
  ylim(0, fivenum(numberOfTrades.df.gain$numberOfTrades)[4]+5) +
  labs(title = 'Boxplot of Number of Trades per Stock-Gain')

numberOfTrades.df.lose <- rbind(eaf.day, gh.day, lx.day)
ggplot(numberOfTrades.df.lose, aes(x=stock, y=numberOfTrades)) +
  geom_boxplot(aes(fill=stock), outlier.shape = NA) +
  scale_fill_manual(values=wes_palette(n=3, name='Darjeeling1')) +
  ylim(0, fivenum(numberOfTrades.df.lose$numberOfTrades)[4]+5) +
  labs(title = 'Boxplot of Number of Trades per Stock-Lose')

The last analysis I did was to look at overall return of each stock. Each stock closed at vastly different prices so it was difficult to determine any context. By looking at the overall return, it was easier to compare each stock’s performance in comparison to the others within the group as the scales were relative to each other.

## [1] "MDSO" "RAD"  "DBX"  "EAF"  "GH"   "LX"

In the gain stocks group, they all had an uptick in overall return right before the day I gathered the data. Conversely, the stocks in the lose group trended downward before the day I gathered the data.