A large number of consumers turn to social media to express concerns and feedback about company performance. It may be possible to link the sentiment of these items to daily changes in stock prices.
Anthem, Inc is the second largest health insurance provider (by membership) in the United States and the largest Blue Cross Blue Shield member organization. For this project I will perform a sentiment analysis on tweets about Anthem, Inc and examine the relationship between sentiment and stock prices.
Data for this project will come from two sources - tweets about Anthem and daily stock prices from Yahoo.com. Tweets will be collected using the twitteR package which uses the Twitter API to search and return tweets. Stock prices will be downloaded as a CSV and loaded into R directly.
Consumers use many different Twitter accounts and hashtags to reach Anthem. The ones used for this analysis are shown below:
@AskAnthem @AnthemInc @ThinkAnthem @CareMoreHealth @Amerigroup @AskBCBSGa @AnthemBusiness
@AskEmpire @empirebcbs #AnthemBCBS #bcbsAnthem
Once tweets have been gathered the above accounts and hashtags a sentiment analysis will be performed to determine the number of positive and negative words contained in each tweet. The resulting score will indicate the overall sentiment of the tweet.
Once sentiment scores are created a simple linear regression analysis will be performed to determine the relative associate between daily sentiment (daily percent of tweets that are positive) with daily change in stock price (opening price - closing price).
require("twitteR")
## Loading required package: twitteR
require("wordcloud")
## Loading required package: wordcloud
## Loading required package: RColorBrewer
require("tm")
## Loading required package: tm
## Loading required package: NLP
require("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:twitteR':
##
## id, location
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require("ggplot2")
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
setwd("~/Desktop/IS607/Data-607/Final Project")
The twitteR package allows for direct connection to the Twitter API. Once connected a user can search twitter, return entire user timelines and perform other functions on Twitter. Before using the package a application must be set up in your Twitter user account. Once set up you will receive a set of codes that can be used to connect the package to Twitter.
## [1] "Using direct authentication"
Once connected we used the searchTwitter function to search twitter for tweets containing the accounts or hashtags from the above. One limitation of the Twitter API is that searching only allows about 7 days worth of tweets to be returned.
tweet_list <- searchTwitter("@AskAnthem OR @AnthemInc OR @ThinkAnthem OR
@CareMoreHealth OR @Amerigroup OR @AskBCBSGa OR @AnthemBusiness
OR @AskEmpire OR @empirebcbs OR #AnthemBCBS OR #bcbsAnthem"
, n = 1000 , lang = "en", since = "2016-12-09")
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 886
The returned tweets are initially stored in a list. For this project need them in a data frame and in text format.
#Convert list to dataframe
tweets.df <- twListToDF(tweet_list)
#removing duplicate tweets (retweets) from dataframe
tweets.nodups.df <- distinct(tweets.df, text, .keep_all = TRUE)
#clean up dataframe a bit
tweets.nodups.df$text <- gsub('…', '', tweets.nodups.df$text) #remove ... at end of tweets
tweets.nodups.df <- plyr::rename(tweets.nodups.df, c("created" = "Date")) #rename created to Date
tweets.nodups.df$Date <- as.Date(tweets.nodups.df$Date) #convert from datetime to date format
#create text list with tweets for sentiment analysis
tweets_text <- lapply(tweet_list, function(x) x$getText())
#fix Mac encoding issue with (from stack overflow)
tweets_text <- sapply(tweets_text,function(row) iconv(row, "latin1", "ASCII", sub=""))
#removing duplicate tweets (retweets) from list
tweets_nodups_text <- unique(tweets_text)
Now that we have the data in the formats we need it in for analysis we can do some preliminary exploration. First we will examine a word cloud of the tweets to see the types of words that make up the returned tweets.
#Create tweet corpus
r_stats_text_corpus <- Corpus(VectorSource(tweets_nodups_text))
#Clean up corpus in prepartion for word cloud
r_stats_text_corpus <- tm_map(r_stats_text_corpus,
content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
mc.cores=1) #Encoding corrections for Mac
r_stats_text_corpus <- tm_map(r_stats_text_corpus, content_transformer(tolower), mc.cores=1) #Transform all text to lower case
r_stats_text_corpus <- tm_map(r_stats_text_corpus, removePunctuation, mc.cores=1) #remove all punctuation
r_stats_text_corpus <- tm_map(r_stats_text_corpus, function(x)removeWords(x,stopwords()), mc.cores=1) #remove all stop words
#Create color word cloud
wordcloud(r_stats_text_corpus, min.freq = 10, max.words = 150, colors=brewer.pal(8, "Dark2"))
For our sentiment analysis we will use a function created that uses a published lexicon of positive and negative words. This function will create a score for each tweet. A score of 0 indicates the tweet is neutral. A score of 1 or more indicates the tweet is positive. A score of -1 or less indicates the tweet is negative. The higher (or lower) the number indicates the relative strength of the sentiment (based on the count of words).
#' Code below from github - see appenix/reference below for more information
#' score.sentiment() implements a very simple algorithm to estimate
#' sentiment, assigning a integer score by subtracting the number
#' of occurrences of negative words from that of positive words.
#'
#' @param sentences vector of text to score
#' @param pos.words vector of words of postive sentiment
#' @param neg.words vector of words of negative sentiment
#' @param .progress passed to <code>laply()</code> to control of progress bar.
#' @returnType data.frame
#' @return data.frame of text and corresponding sentiment scores
#' @author Jefrey Breen <jbreen@cambridge.aero>
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
require(plyr)
require(stringr)
# we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
# we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
scores = laply(sentences, function(sentence, pos.words, neg.words) {
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
#The positive and negative words lexicons are stored in a local director
#Please see appendix/reference for more information on origin
hu.liu.pos = scan('positive-words.txt', what = 'character', comment.char = ';')
hu.liu.neg = scan('negative-words.txt', what = 'character', comment.char = ';')
#Here we add some additional words that were discovered from initial review of tweets
pos.words <- c(hu.liu.pos)
neg.words <- c(hu.liu.neg, 'wait', 'waiting', 'hold', 'onhold' , 'on hold', 'asshole', 'cancel','spam', 'spams', 'cancel', 'wtf')
Once the sentiment function and negative and positive lexicons have been created the function can be ran on the text file we created earlier to score each tweet.
#run the sentiment function on the text of the tweets
anthem.scores <- score.sentiment(tweets_nodups_text, pos.words, neg.words, .progress='none')
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:twitteR':
##
## id
## Loading required package: stringr
#merge the results back with the original file
anthem.score.merge <- merge(anthem.scores, tweets.nodups.df, by = 'text')
Next we will explore the results of the sentiment analysis before proceeding with correlation analysis.
#Histogram of sentiment for all tweets
hist(anthem.score.merge$score,xlab=" ",main="Sentiment of tweets that mention Anthem BCBS",
border="black",col="skyblue")
#scatter plot of tweet date vs sentiment score
plot(anthem.score.merge$Date, anthem.score.merge$score, xlab = "Date", ylab = "Sentiment Score", main = "Sentiment of tweets that mention Anthem BCBS by Date")
#taken from https://www.r-bloggers.com/twitter-sentiment-analysis-with-r/
#total evaluation: positive / negative / neutral
stat <- anthem.score.merge$score
stat <- mutate(anthem.score.merge, tweet=ifelse(anthem.score.merge$score > 0, 'positive', ifelse(anthem.score.merge$score < 0, 'negative', 'neutral')))
by.tweet <- group_by(stat, tweet, Date)
by.tweet <- dplyr::summarise(by.tweet, number=n())
#Sentiment (positive, negative and neutral) over time
ggplot(by.tweet, aes(Date, number)) + geom_line(aes(group=tweet, color=tweet), size=2) +
geom_point(aes(group=tweet, color=tweet), size=4) +
theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1))
Purely from the histogram we see that sentiment scores look very normal with the majority of the tweets having a slightly negative (-1) sentiment score. From the scatter plot there does not appear to be much of a trend (keeping in mind that many points are plotted on top of each other here).
Finally from the graph of sentiment over time we can see that overall sentiment varies by date and seems to have trends over time.
While sentiment on it’s own is a fairly interesting analysis we wanted to further analyze and look for any correlation with stock prices.
First we read a csv file in that was downloaded from Yahoo.com. This file was then cleaned a bit (date formatting) and joined with the file that contained each tweet and the associated sentiment. Finally we eliminated any tweets that had no daily stock price change since this indicated they came from a weekend when no trading occurred.
#Read stock price CSV in
stock_prices <- read.csv("ANTM Stock Prices.csv")
#Format date so R knows this is a date field
stock_prices$Date <- as.Date(stock_prices$Date, "%m/%d/%y")
#Left join the sentiment analysis with the stock prices
tweet_stock <- left_join(anthem.score.merge, stock_prices, by = "Date")
#eliminate rows with no daily change
#eliminates weekend tweets
weekday_tweet_stock <- subset(tweet_stock, !is.na(Daily.Change))
Now that we have the daily stock price change with each tweet and the sentiment of those tweets we can perform a simple correlation to check for association.
#Raw plot of sentiment score versus daily change in stock price
plot(jitter(weekday_tweet_stock$score), weekday_tweet_stock$Daily.Change, xlab = "Sentiment Score", ylab = "Daily Change in Stock Price")
#The below was modified from a LinkedIn PPT describing sentiment analysis in R
#see appendix / reference for more information
#Create indicator fields to flag tweets as positive, negative or neutral based on sentiment score
weekday_tweet_stock$pos <- as.numeric(weekday_tweet_stock$score >= 1)
weekday_tweet_stock$neg <- as.numeric(weekday_tweet_stock$score <= -1)
weekday_tweet_stock$neu <- as.numeric(weekday_tweet_stock$score == 0)
#Transform file from one row per tweet to one row per day summarizing the total positive, negative and netural tweets per day
tweet_stock_df <- ddply(weekday_tweet_stock, c('Date', 'Up.Down', 'Daily.Change'), plyr::summarise, pos.count = sum(pos), neg.count = sum(neg), neu.count = sum(neu))
tweet_stock_df$all.count <- tweet_stock_df$pos.count + tweet_stock_df$neg.count + tweet_stock_df$neu.count
#calculate the percent of tweets that were positive on each day
tweet_stock_df$percent.pos <- round((tweet_stock_df$pos.count / tweet_stock_df$all.count) * 100)
#Simple correlation
cor(tweet_stock_df$percent.pos, tweet_stock_df$Daily.Change, use = "complete")
## [1] -0.110713
#Linear model
lm_model <- lm(tweet_stock_df$Daily.Change ~ tweet_stock_df$percent.pos)
summary(lm_model)
##
## Call:
## lm(formula = tweet_stock_df$Daily.Change ~ tweet_stock_df$percent.pos)
##
## Residuals:
## 1 2 3 4 5 6
## 0.6171 -1.2238 -0.7400 1.4003 1.3077 -1.3614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.10660 2.85609 0.387 0.718
## tweet_stock_df$percent.pos -0.01677 0.07527 -0.223 0.835
##
## Residual standard error: 1.41 on 4 degrees of freedom
## Multiple R-squared: 0.01226, Adjusted R-squared: -0.2347
## F-statistic: 0.04964 on 1 and 4 DF, p-value: 0.8346
#plot of % positive tweets vs daily change in stock price with linear regression line overlaid
plot(tweet_stock_df$percent.pos, tweet_stock_df$Daily.Change, ylab = "Daily Change in Stock Price", xlab = "Percent of Tweets Positive", main = "% Positive Tweets vs Daily Stock Price Change for ANTM")
abline(lm_model)
From the final plot of % of positive tweets versus daily change in stock price we can see that linear regression line fitted has a negative slope indicating that as the % positive tweets increases the stock price decreases. The linear model explains a very small proportion of the variance in the data (~2%).
From the sentiment analysis we saw that the majority of the tweets from this period were negative. This is not terribly surprising since most individuals only tweet when there is an issue. Given that this period is also very busy for open enrollment signups there may be more people than normal experiencing issues.
The correlation analysis had unexpected results - a negative correlation between the number of positive tweets and daily change in stock price. These results must be interpreted with extreme caution though due to the very small number of dates for which tweets were available and the presence of an outlier in the data. It does appear with more data a pattern might be uncovered.
Some limitations: