Introduction

Correlation of Twitter Sentiment with Stock Price for Anthem, Inc

A large number of consumers turn to social media to express concerns and feedback about company performance. It may be possible to link the sentiment of these items to daily changes in stock prices.

Anthem, Inc is the second largest health insurance provider (by membership) in the United States and the largest Blue Cross Blue Shield member organization. For this project I will perform a sentiment analysis on tweets about Anthem, Inc and examine the relationship between sentiment and stock prices.

Data Used & Process

Data for this project will come from two sources - tweets about Anthem and daily stock prices from Yahoo.com. Tweets will be collected using the twitteR package which uses the Twitter API to search and return tweets. Stock prices will be downloaded as a CSV and loaded into R directly.

Consumers use many different Twitter accounts and hashtags to reach Anthem. The ones used for this analysis are shown below:

@AskAnthem @AnthemInc @ThinkAnthem @CareMoreHealth @Amerigroup @AskBCBSGa @AnthemBusiness
@AskEmpire @empirebcbs #AnthemBCBS #bcbsAnthem

Once tweets have been gathered the above accounts and hashtags a sentiment analysis will be performed to determine the number of positive and negative words contained in each tweet. The resulting score will indicate the overall sentiment of the tweet.

Once sentiment scores are created a simple linear regression analysis will be performed to determine the relative associate between daily sentiment (daily percent of tweets that are positive) with daily change in stock price (opening price - closing price).

Required Packages and Working Directory

require("twitteR")

## Loading required package: twitteR

require("wordcloud")

## Loading required package: wordcloud

## Loading required package: RColorBrewer

require("tm")

## Loading required package: tm

## Loading required package: NLP

require("dplyr")

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:twitteR':
## 
##     id, location

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require("ggplot2")

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

setwd("~/Desktop/IS607/Data-607/Final Project")

Collection of Tweets

The twitteR package allows for direct connection to the Twitter API. Once connected a user can search twitter, return entire user timelines and perform other functions on Twitter. Before using the package a application must be set up in your Twitter user account. Once set up you will receive a set of codes that can be used to connect the package to Twitter.

## [1] "Using direct authentication"

Once connected we used the searchTwitter function to search twitter for tweets containing the accounts or hashtags from the above. One limitation of the Twitter API is that searching only allows about 7 days worth of tweets to be returned.

tweet_list <- searchTwitter("@AskAnthem OR @AnthemInc OR @ThinkAnthem OR
                            @CareMoreHealth OR @Amerigroup OR @AskBCBSGa OR @AnthemBusiness  
                            OR @AskEmpire OR @empirebcbs OR #AnthemBCBS OR #bcbsAnthem"
                            , n = 1000 , lang  = "en", since = "2016-12-09")

## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 886

The returned tweets are initially stored in a list. For this project need them in a data frame and in text format.

#Convert list to dataframe
tweets.df <- twListToDF(tweet_list)

#removing duplicate tweets (retweets) from dataframe
tweets.nodups.df <- distinct(tweets.df, text, .keep_all = TRUE)

#clean up dataframe a bit 
tweets.nodups.df$text <- gsub('…', '', tweets.nodups.df$text) #remove ... at end of tweets

tweets.nodups.df <- plyr::rename(tweets.nodups.df, c("created" = "Date")) #rename created to Date 

tweets.nodups.df$Date <- as.Date(tweets.nodups.df$Date) #convert from datetime to date format

#create text list with tweets for sentiment analysis
tweets_text <- lapply(tweet_list, function(x) x$getText())

#fix Mac encoding issue with (from stack overflow)
tweets_text <- sapply(tweets_text,function(row) iconv(row, "latin1", "ASCII", sub=""))

#removing duplicate tweets (retweets) from list
tweets_nodups_text <- unique(tweets_text)

Exploring the tweets

Now that we have the data in the formats we need it in for analysis we can do some preliminary exploration. First we will examine a word cloud of the tweets to see the types of words that make up the returned tweets.

#Create tweet corpus
r_stats_text_corpus <- Corpus(VectorSource(tweets_nodups_text))

#Clean up corpus in prepartion for word cloud
r_stats_text_corpus <- tm_map(r_stats_text_corpus,
                              content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
                              mc.cores=1) #Encoding corrections for Mac
r_stats_text_corpus <- tm_map(r_stats_text_corpus, content_transformer(tolower), mc.cores=1) #Transform all text to lower case
r_stats_text_corpus <- tm_map(r_stats_text_corpus, removePunctuation, mc.cores=1) #remove all punctuation 
r_stats_text_corpus <- tm_map(r_stats_text_corpus, function(x)removeWords(x,stopwords()), mc.cores=1) #remove all stop words

#Create color word cloud 
wordcloud(r_stats_text_corpus, min.freq = 10, max.words = 150, colors=brewer.pal(8, "Dark2"))

Sentiment Analysis

For our sentiment analysis we will use a function created that uses a published lexicon of positive and negative words. This function will create a score for each tweet. A score of 0 indicates the tweet is neutral. A score of 1 or more indicates the tweet is positive. A score of -1 or less indicates the tweet is negative. The higher (or lower) the number indicates the relative strength of the sentiment (based on the count of words).

#' Code below from github - see appenix/reference below for more information
#' score.sentiment() implements a very simple algorithm to estimate
#' sentiment, assigning a integer score by subtracting the number 
#' of occurrences of negative words from that of positive words.
#' 
#' @param sentences vector of text to score
#' @param pos.words vector of words of postive sentiment
#' @param neg.words vector of words of negative sentiment
#' @param .progress passed to <code>laply()</code> to control of progress bar.
#' @returnType data.frame
#' @return data.frame of text and corresponding sentiment scores
#' @author Jefrey Breen <jbreen@cambridge.aero>
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)
  require(stringr)
  
  # we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
  # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    sentence = tolower(sentence)
    
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    
    # match() returns the position of the matched term or NA
    # we just want a TRUE/FALSE:
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    
    return(score)
  }, pos.words, neg.words, .progress=.progress )
  
  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
}

#The positive and negative words lexicons are stored in a local director
#Please see appendix/reference for more information on origin
hu.liu.pos = scan('positive-words.txt', what = 'character', comment.char = ';')
hu.liu.neg = scan('negative-words.txt', what = 'character', comment.char = ';')

#Here we add some additional words that were discovered from initial review of tweets
pos.words <- c(hu.liu.pos)
neg.words <- c(hu.liu.neg, 'wait', 'waiting', 'hold', 'onhold' , 'on hold', 'asshole', 'cancel','spam', 'spams', 'cancel', 'wtf')

Once the sentiment function and negative and positive lexicons have been created the function can be ran on the text file we created earlier to score each tweet.

#run the sentiment function on the text of the tweets 
anthem.scores <- score.sentiment(tweets_nodups_text, pos.words, neg.words, .progress='none')

## Loading required package: plyr

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:twitteR':
## 
##     id

## Loading required package: stringr

#merge the results back with the original file 
anthem.score.merge <- merge(anthem.scores, tweets.nodups.df, by = 'text')

Next we will explore the results of the sentiment analysis before proceeding with correlation analysis.

#Histogram of sentiment for all tweets
hist(anthem.score.merge$score,xlab=" ",main="Sentiment of tweets that mention Anthem BCBS",
     border="black",col="skyblue")

#scatter plot of tweet date vs sentiment score 
plot(anthem.score.merge$Date, anthem.score.merge$score, xlab = "Date", ylab = "Sentiment Score", main = "Sentiment of tweets that mention Anthem BCBS by Date")

#taken from https://www.r-bloggers.com/twitter-sentiment-analysis-with-r/ 
#total evaluation: positive / negative / neutral
stat <- anthem.score.merge$score
stat <- mutate(anthem.score.merge, tweet=ifelse(anthem.score.merge$score > 0, 'positive', ifelse(anthem.score.merge$score < 0, 'negative', 'neutral')))
by.tweet <- group_by(stat, tweet, Date)
by.tweet <- dplyr::summarise(by.tweet, number=n())

#Sentiment (positive, negative and neutral) over time
ggplot(by.tweet, aes(Date, number)) + geom_line(aes(group=tweet, color=tweet), size=2) +
  geom_point(aes(group=tweet, color=tweet), size=4) +
  theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1))

Purely from the histogram we see that sentiment scores look very normal with the majority of the tweets having a slightly negative (-1) sentiment score. From the scatter plot there does not appear to be much of a trend (keeping in mind that many points are plotted on top of each other here).

Finally from the graph of sentiment over time we can see that overall sentiment varies by date and seems to have trends over time.

Correlation with Stock Prices

While sentiment on it’s own is a fairly interesting analysis we wanted to further analyze and look for any correlation with stock prices.

First we read a csv file in that was downloaded from Yahoo.com. This file was then cleaned a bit (date formatting) and joined with the file that contained each tweet and the associated sentiment. Finally we eliminated any tweets that had no daily stock price change since this indicated they came from a weekend when no trading occurred.

#Read stock price CSV in 
stock_prices <- read.csv("ANTM Stock Prices.csv")

#Format date so R knows this is a date field
stock_prices$Date <- as.Date(stock_prices$Date, "%m/%d/%y")

#Left join the sentiment analysis with the stock prices 
tweet_stock <- left_join(anthem.score.merge, stock_prices, by = "Date")

#eliminate rows with no daily change 
#eliminates weekend tweets 
weekday_tweet_stock <- subset(tweet_stock,  !is.na(Daily.Change))

Now that we have the daily stock price change with each tweet and the sentiment of those tweets we can perform a simple correlation to check for association.

#Raw plot of sentiment score versus daily change in stock price
plot(jitter(weekday_tweet_stock$score), weekday_tweet_stock$Daily.Change, xlab = "Sentiment Score", ylab = "Daily Change in Stock Price")

#The below was modified from a LinkedIn PPT describing sentiment analysis in R
#see appendix / reference for more information 

#Create indicator fields to flag tweets as positive, negative or neutral based on sentiment score 
weekday_tweet_stock$pos <- as.numeric(weekday_tweet_stock$score >= 1)
weekday_tweet_stock$neg <- as.numeric(weekday_tweet_stock$score <= -1)
weekday_tweet_stock$neu <- as.numeric(weekday_tweet_stock$score == 0)

#Transform file from one row per tweet to one row per day summarizing the total positive, negative and netural tweets per day 
tweet_stock_df <- ddply(weekday_tweet_stock, c('Date', 'Up.Down', 'Daily.Change'), plyr::summarise, pos.count = sum(pos), neg.count = sum(neg), neu.count = sum(neu))

tweet_stock_df$all.count <- tweet_stock_df$pos.count + tweet_stock_df$neg.count + tweet_stock_df$neu.count

#calculate the percent of tweets that were positive on each day 
tweet_stock_df$percent.pos <- round((tweet_stock_df$pos.count / tweet_stock_df$all.count) * 100)

#Simple correlation
cor(tweet_stock_df$percent.pos, tweet_stock_df$Daily.Change, use = "complete")

## [1] -0.110713

#Linear model
lm_model <- lm(tweet_stock_df$Daily.Change ~ tweet_stock_df$percent.pos)
summary(lm_model)

## 
## Call:
## lm(formula = tweet_stock_df$Daily.Change ~ tweet_stock_df$percent.pos)
## 
## Residuals:
##       1       2       3       4       5       6 
##  0.6171 -1.2238 -0.7400  1.4003  1.3077 -1.3614 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                 1.10660    2.85609   0.387    0.718
## tweet_stock_df$percent.pos -0.01677    0.07527  -0.223    0.835
## 
## Residual standard error: 1.41 on 4 degrees of freedom
## Multiple R-squared:  0.01226,    Adjusted R-squared:  -0.2347 
## F-statistic: 0.04964 on 1 and 4 DF,  p-value: 0.8346

#plot of % positive tweets vs daily change in stock price with linear regression line overlaid
plot(tweet_stock_df$percent.pos, tweet_stock_df$Daily.Change, ylab = "Daily Change in Stock Price",  xlab = "Percent of Tweets Positive", main = "% Positive Tweets vs Daily Stock Price Change for ANTM")
abline(lm_model)

From the final plot of % of positive tweets versus daily change in stock price we can see that linear regression line fitted has a negative slope indicating that as the % positive tweets increases the stock price decreases. The linear model explains a very small proportion of the variance in the data (~2%).

Conclusions / Limitations

From the sentiment analysis we saw that the majority of the tweets from this period were negative. This is not terribly surprising since most individuals only tweet when there is an issue. Given that this period is also very busy for open enrollment signups there may be more people than normal experiencing issues.

The correlation analysis had unexpected results - a negative correlation between the number of positive tweets and daily change in stock price. These results must be interpreted with extreme caution though due to the very small number of dates for which tweets were available and the presence of an outlier in the data. It does appear with more data a pattern might be uncovered.

Some limitations:

Twitter’s API only allows for searches to return the previous 7 days worth of tweets. This severely limited the number of days available for analysis. In the future this data could be collected longitudinally and stored until sufficient data was available to redo analysis.
Anthem uses many different social media accounts on Twitter. Some are more official than others. When checked the number was close to 50. Including more of these accounts might be helpful, but could skew the data if the accounts were not often used by consumers.
Blue Cross Blue Shield is a generic term for insurance offered through companies who have an agreement with the blue cross blue shield association. Consumers often use terms like “BCBS”, “Blue Cross Blue Shield” or any number of Anthem’s state level plan names (e.g. “Blue Cross Blue Shield of Georgia”) to talk about Anthem on social media. Unfortunately these terms are too general to include in this analysis because they are not clearly talking about Anthem. * “Anthem” is a very general term (as opposed to Walmart or iPhone) so searching for #Anthem returned many tweets that had nothing to do with Anthem the company. We excluded #Anthem for this purpose.
The correlation analysis here used linear regression. Given that stock prices are essentially time series data this type of analysis might not be the best way to correlate stock prices and sentiment scores. While not sure of the exact method that might be more apprpriate - non-parametric (signed rank) test come to mind as a possibility.
The sentiment analysis performed here was very simple. The very presence of a negative or positive word does not mean the tweet itself is positive or negative (i.e. sarcasm).

Appendix / References

Sentiment Lexicon: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA, Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing Opinions on the Web.” Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.
Mining Twitter for Sentiment slide deck: http://www.slideshare.net/jeffreybreen/r-by-example-mining-twitter-for
Sentiment Function https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/blob/master/R/sentiment.R
Sentiment Analysis on Donald Trump using R and Tableau: https://www.r-bloggers.com/sentiment-analysis-on-donald-trump-using-r-and-tableau/
Sentiment Analysis in R https://www.r-bloggers.com/twitter-sentiment-analysis-with-r/
Word Clouds in R https://www.r-bloggers.com/word-cloud-in-r/
Using the R Twitter package http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/

Data 607 - Final Project

Brandon OHara

12/18/2016