DATA 607 Final Project

Team Members

Santosh Manjrekar, Robert Lauto

library('SnowballC')
library('tm')
library('twitteR')
library('syuzhet')
library('dplyr')
library('stringr')
library(ggplot2)
library(dplyr)
library(tidyr)
library(wordcloud)
library(knitr)
library(kableExtra)

Project description

Analyze the correlation of twitter sentiment analysis on stock price

Twitter is a large social platform where users are free to share whatever is on their mind in brief text messages. There are many users on twitter, most belonging to the general public, however, many companies have joined the platform to help keep up their public relations. Twitter is a powerful platform where the public can influence companies because it allows the public to air their appreciation or frustrations with them on a public platform. There often are trending hashtags on twitter that could be part of a successful marketing campaign for a company or, on the other hand, a failed marketing campaign that backfired. In this project we will try to better understand the trends on twitter and assess if there is any relationship between twitter sentiment and stock price.

Data Extraction

To collect the data we used the twitteR package and requested data from the Twitter Search API. We narrowed our search to tweets that contained Microsoft affiliated twitter handles. Unfortunatley, using the twitter API you can only search and extract data for the last 10 days. Data was saved in the csv file per day. Please refer to the data extraction program on github.

Here is the data extracted for one tweet.

df_msft_twit_data = read.csv("/Users/Rob/Documents/MSDS/DATA 607/Projects/final_project/MSFT- 2018-12-11 -twitt-emotion-senti-data.csv")
head(df_msft_twit_data, 1)

##   X
## 1 6
##                                                                                                 text
## 1 ShiSh ShridharDirector Business Strategy Retail Industrydiscusses what\x92s really working and\x85
##   favorited favoriteCount replyToSN             created truncated
## 1     FALSE             1      <NA> 2018-12-11 23:58:10      TRUE
##   replyToSID           id replyToUID
## 1         NA 1.072642e+18         NA
##                                                            statusSource
## 1 <a href="https://www.hootsuite.com" rel="nofollow">Hootsuite Inc.</a>
##        screenName retweetCount isRetweet retweeted longitude latitude
## 1 ObjectMgmtGroup            2     FALSE     FALSE        NA       NA
##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     0            0       0    0   0       0        0     0        0
##   positive category_senti sent.value
## 1        1       Positive       0.25

Data Processing or tidying

Retweeted tweets were filtered out. We used library syuzhet for sentiment analysis. This library gives different emotion score for the each tweet. Here is the sample of the emotion score

We grouped the data per day and average emotion score was calculated for the gathered tweets for the given day. So daily sentiment and emotion score was calculated.

Here is the sample

df_msft_daily_summary = read.csv("/Users/Rob/Documents/MSDS/DATA 607/Projects/final_project/MSFT-twitt-final-summary.csv")

kable(head(df_msft_daily_summary)) %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "300px")

X	data_date	net_sent	net_anger	net_anticipation	net_disgust	net_fear	net_joy	net_sadness	net_surprise	net_trust	net_negative	net_positive
1	2018-12-03	0.6941333	0.0725236	0.4510613	0.0583726	0.0931604	0.3873821	0.0790094	0.0778302	0.5047170	0.1650943	0.9669811
1	2018-12-04	0.4753744	0.0840266	0.3356905	0.0640599	0.1114809	0.2583195	0.0952579	0.0994176	0.3677205	0.1880200	0.7113145
1	2018-12-06	0.4529504	0.1023499	0.3404700	0.0788512	0.1321149	0.2443864	0.1033943	0.1091384	0.3650131	0.2036554	0.6610966
1	2018-12-07	0.4471648	0.0840560	0.3288859	0.0593729	0.1160774	0.2308205	0.0967312	0.1054036	0.3689126	0.2008005	0.6777852
1	2018-12-10	0.3467890	0.1201001	0.2468724	0.0783987	0.1134279	0.1801501	0.1009174	0.0800667	0.2952460	0.2410342	0.5663053
1	2018-12-11	0.4900310	0.0916409	0.3498452	0.0767802	0.1281734	0.2681115	0.0972136	0.0842105	0.3894737	0.2198142	0.7597523

Data Analysis

Here is the graph of the Postive and Negative sentiments and emotion scores for each day. Positive sentiments and positve emotions show the same pattern…similarly negative sentiment and negative emitions follow the same pattern.

ggplot(df_msft_daily_summary, aes(x = data_date)) + 
  geom_line(aes(y = net_sent, group=1), colour="red") +
  geom_line(aes(y = net_anger, group=2), colour="yellow") +
  geom_line(aes(y = net_anticipation, group=3), colour="green") +
  geom_line(aes(y = net_disgust, group=4), colour="chocolate") +
  geom_line(aes(y = net_fear, group=5), colour="black") +
  geom_line(aes(y = net_joy, group=6), colour="slateblue") +
  geom_line(aes(y = net_sadness, group=7), colour="darkviolet") +
  geom_line(aes(y = net_surprise, group=8), colour="orange") +
  geom_line(aes(y = net_trust, group=9), colour="skyblue") +
  geom_line(aes(y = net_positive, group=10), colour="chocolate") +
  geom_line(aes(y = net_negative, group=11), colour="tomato") +
  
  ylab(label="Score") + 
  xlab("Tweet Date")

Positive sentiment word cloud

df_positive <-df_msft_twit_data[df_msft_twit_data$sent.value > 3,]
nrow(df_positive)

## [1] 14

#Show wordcloud

positive_corp <- Corpus(VectorSource(df_positive$text))

positive_corp <- positive_corp%>% tm_map(content_transformer(removePunctuation))
positive_corp <-tm_map(positive_corp,removeWords, c("the","and","that","this","was","with","for","your"))
positive_corp <- positive_corp %>% tm_map(content_transformer(removeNumbers))
positive_corp <- positive_corp %>% tm_map(content_transformer(stemDocument),  language = 'english')
positive_corp <- positive_corp %>% tm_map(content_transformer(tolower))
#Stemming seems to truncate words

wordcloud(positive_corp, max.words = 75, random.order = FALSE, random.color = TRUE,colors=palette())

Negative sentiment word cloud

Looks like many people not happy with windows’ updates.

df_negative <-df_msft_twit_data[df_msft_twit_data$sent.value < -1,]
nrow(df_negative)

## [1] 34

#Show wordcloud

negative_corp <- Corpus(VectorSource(df_negative$text))

negative_corp <- negative_corp%>% tm_map(content_transformer(removePunctuation))
negative_corp <- negative_corp %>% tm_map(content_transformer(removeNumbers))
negative_corp <-tm_map(negative_corp,removeWords, c("the","and","that","this","was","with","for","your"))
negative_corp <- negative_corp %>% tm_map(content_transformer(stemDocument),  language = 'english')
negative_corp <- negative_corp %>% tm_map(content_transformer(tolower))
#Stemming seems to truncate words

wordcloud(negative_corp, max.words = 75, random.order = FALSE, random.color = TRUE,colors=palette())

Twitter sentiment and stock price analysis

msft <- read.csv('/Users/Rob/Downloads/MSFT.csv')
msft$Date <- as.Date(msft$Date)

## Warning in strptime(xx, f <- "%Y-%m-%d", tz = "GMT"): unknown timezone
## 'zone/tz/2018g.1.0/zoneinfo/America/New_York'

df_msft_daily_summary$data_date <- as.Date(df_msft_daily_summary$data_date)
stock_n_sent <- merge(df_msft_daily_summary, msft, by.x = 'data_date', by.y = 'Date')
microsoft_model <- lm(Close ~ net_sent, stock_n_sent)

ggplotRegression <- function (fit) {

require(ggplot2)

ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
                     "Intercept =",signif(fit$coef[[1]],5 ),
                     " Slope =",signif(fit$coef[[2]], 5),
                     " P =",signif(summary(fit)$coef[2,4], 5)))
}
cor(stock_n_sent$Close, stock_n_sent$net_sent)

## [1] 0.7378167

ggplotRegression(microsoft_model)

summary(microsoft_model)

## 
## Call:
## lm(formula = Close ~ net_sent, data = stock_n_sent)
## 
## Residuals:
##        1        2        3        4        5        6 
##  0.43698  0.19057  1.20126 -3.08085  1.21415  0.03789 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.107      3.444  29.359 8.01e-06 ***
## net_sent      15.193      6.950   2.186   0.0941 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.777 on 4 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.4305 
## F-statistic: 4.779 on 1 and 4 DF,  p-value: 0.0941

Conclusions

While there appears to be a postive correlation between twitter sentiment and microsoft stock value, sentiment score and stock price.
To run a linear regression, typically a minimum sample size of 30 is used. We were unable to gather more data due to the restrictions on the twitter search API. Howeverm, it looks promising that with more data we would be able to determine if there is a correlation between twitter sentiment and MSFT value.
There are some other factors associated with stock price other than tweets that can be included in futher analysis. Also tweets may not be the best predictor for stocks due to the fact that people could tweet mentioning Microsoft or a company but be discussing things tangential to the actual company.
This process was built with teh intention of finding the correlation of twitter sentiment and stock price but this process can be used to find out the response of the general public via tweets for new product launches and product reviews or similar use cases.