Santosh Manjrekar, Robert Lauto
library('SnowballC')
library('tm')
library('twitteR')
library('syuzhet')
library('dplyr')
library('stringr')
library(ggplot2)
library(dplyr)
library(tidyr)
library(wordcloud)
library(knitr)
library(kableExtra)
Twitter is a large social platform where users are free to share whatever is on their mind in brief text messages. There are many users on twitter, most belonging to the general public, however, many companies have joined the platform to help keep up their public relations. Twitter is a powerful platform where the public can influence companies because it allows the public to air their appreciation or frustrations with them on a public platform. There often are trending hashtags on twitter that could be part of a successful marketing campaign for a company or, on the other hand, a failed marketing campaign that backfired. In this project we will try to better understand the trends on twitter and assess if there is any relationship between twitter sentiment and stock price.
To collect the data we used the twitteR package and requested data from the Twitter Search API. We narrowed our search to tweets that contained Microsoft affiliated twitter handles. Unfortunatley, using the twitter API you can only search and extract data for the last 10 days. Data was saved in the csv file per day. Please refer to the data extraction program on github.
Here is the data extracted for one tweet.
df_msft_twit_data = read.csv("/Users/Rob/Documents/MSDS/DATA 607/Projects/final_project/MSFT- 2018-12-11 -twitt-emotion-senti-data.csv")
head(df_msft_twit_data, 1)
## X
## 1 6
## text
## 1 ShiSh ShridharDirector Business Strategy Retail Industrydiscusses what\x92s really working and\x85
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 1 <NA> 2018-12-11 23:58:10 TRUE
## replyToSID id replyToUID
## 1 NA 1.072642e+18 NA
## statusSource
## 1 <a href="https://www.hootsuite.com" rel="nofollow">Hootsuite Inc.</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 ObjectMgmtGroup 2 FALSE FALSE NA NA
## anger anticipation disgust fear joy sadness surprise trust negative
## 1 0 0 0 0 0 0 0 0 0
## positive category_senti sent.value
## 1 1 Positive 0.25
Retweeted tweets were filtered out. We used library syuzhet for sentiment analysis. This library gives different emotion score for the each tweet. Here is the sample of the emotion score
We grouped the data per day and average emotion score was calculated for the gathered tweets for the given day. So daily sentiment and emotion score was calculated.
Here is the sample
df_msft_daily_summary = read.csv("/Users/Rob/Documents/MSDS/DATA 607/Projects/final_project/MSFT-twitt-final-summary.csv")
kable(head(df_msft_daily_summary)) %>%
kable_styling() %>%
scroll_box(width = "100%", height = "300px")
| X | data_date | net_sent | net_anger | net_anticipation | net_disgust | net_fear | net_joy | net_sadness | net_surprise | net_trust | net_negative | net_positive |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2018-12-03 | 0.6941333 | 0.0725236 | 0.4510613 | 0.0583726 | 0.0931604 | 0.3873821 | 0.0790094 | 0.0778302 | 0.5047170 | 0.1650943 | 0.9669811 |
| 1 | 2018-12-04 | 0.4753744 | 0.0840266 | 0.3356905 | 0.0640599 | 0.1114809 | 0.2583195 | 0.0952579 | 0.0994176 | 0.3677205 | 0.1880200 | 0.7113145 |
| 1 | 2018-12-06 | 0.4529504 | 0.1023499 | 0.3404700 | 0.0788512 | 0.1321149 | 0.2443864 | 0.1033943 | 0.1091384 | 0.3650131 | 0.2036554 | 0.6610966 |
| 1 | 2018-12-07 | 0.4471648 | 0.0840560 | 0.3288859 | 0.0593729 | 0.1160774 | 0.2308205 | 0.0967312 | 0.1054036 | 0.3689126 | 0.2008005 | 0.6777852 |
| 1 | 2018-12-10 | 0.3467890 | 0.1201001 | 0.2468724 | 0.0783987 | 0.1134279 | 0.1801501 | 0.1009174 | 0.0800667 | 0.2952460 | 0.2410342 | 0.5663053 |
| 1 | 2018-12-11 | 0.4900310 | 0.0916409 | 0.3498452 | 0.0767802 | 0.1281734 | 0.2681115 | 0.0972136 | 0.0842105 | 0.3894737 | 0.2198142 | 0.7597523 |
Here is the graph of the Postive and Negative sentiments and emotion scores for each day. Positive sentiments and positve emotions show the same pattern…similarly negative sentiment and negative emitions follow the same pattern.
ggplot(df_msft_daily_summary, aes(x = data_date)) +
geom_line(aes(y = net_sent, group=1), colour="red") +
geom_line(aes(y = net_anger, group=2), colour="yellow") +
geom_line(aes(y = net_anticipation, group=3), colour="green") +
geom_line(aes(y = net_disgust, group=4), colour="chocolate") +
geom_line(aes(y = net_fear, group=5), colour="black") +
geom_line(aes(y = net_joy, group=6), colour="slateblue") +
geom_line(aes(y = net_sadness, group=7), colour="darkviolet") +
geom_line(aes(y = net_surprise, group=8), colour="orange") +
geom_line(aes(y = net_trust, group=9), colour="skyblue") +
geom_line(aes(y = net_positive, group=10), colour="chocolate") +
geom_line(aes(y = net_negative, group=11), colour="tomato") +
ylab(label="Score") +
xlab("Tweet Date")
df_positive <-df_msft_twit_data[df_msft_twit_data$sent.value > 3,]
nrow(df_positive)
## [1] 14
#Show wordcloud
positive_corp <- Corpus(VectorSource(df_positive$text))
positive_corp <- positive_corp%>% tm_map(content_transformer(removePunctuation))
positive_corp <-tm_map(positive_corp,removeWords, c("the","and","that","this","was","with","for","your"))
positive_corp <- positive_corp %>% tm_map(content_transformer(removeNumbers))
positive_corp <- positive_corp %>% tm_map(content_transformer(stemDocument), language = 'english')
positive_corp <- positive_corp %>% tm_map(content_transformer(tolower))
#Stemming seems to truncate words
wordcloud(positive_corp, max.words = 75, random.order = FALSE, random.color = TRUE,colors=palette())
Looks like many people not happy with windows’ updates.
df_negative <-df_msft_twit_data[df_msft_twit_data$sent.value < -1,]
nrow(df_negative)
## [1] 34
#Show wordcloud
negative_corp <- Corpus(VectorSource(df_negative$text))
negative_corp <- negative_corp%>% tm_map(content_transformer(removePunctuation))
negative_corp <- negative_corp %>% tm_map(content_transformer(removeNumbers))
negative_corp <-tm_map(negative_corp,removeWords, c("the","and","that","this","was","with","for","your"))
negative_corp <- negative_corp %>% tm_map(content_transformer(stemDocument), language = 'english')
negative_corp <- negative_corp %>% tm_map(content_transformer(tolower))
#Stemming seems to truncate words
wordcloud(negative_corp, max.words = 75, random.order = FALSE, random.color = TRUE,colors=palette())
msft <- read.csv('/Users/Rob/Downloads/MSFT.csv')
msft$Date <- as.Date(msft$Date)
## Warning in strptime(xx, f <- "%Y-%m-%d", tz = "GMT"): unknown timezone
## 'zone/tz/2018g.1.0/zoneinfo/America/New_York'
df_msft_daily_summary$data_date <- as.Date(df_msft_daily_summary$data_date)
stock_n_sent <- merge(df_msft_daily_summary, msft, by.x = 'data_date', by.y = 'Date')
microsoft_model <- lm(Close ~ net_sent, stock_n_sent)
ggplotRegression <- function (fit) {
require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}
cor(stock_n_sent$Close, stock_n_sent$net_sent)
## [1] 0.7378167
ggplotRegression(microsoft_model)
summary(microsoft_model)
##
## Call:
## lm(formula = Close ~ net_sent, data = stock_n_sent)
##
## Residuals:
## 1 2 3 4 5 6
## 0.43698 0.19057 1.20126 -3.08085 1.21415 0.03789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.107 3.444 29.359 8.01e-06 ***
## net_sent 15.193 6.950 2.186 0.0941 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.777 on 4 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.4305
## F-statistic: 4.779 on 1 and 4 DF, p-value: 0.0941
While there appears to be a postive correlation between twitter sentiment and microsoft stock value, sentiment score and stock price.
To run a linear regression, typically a minimum sample size of 30 is used. We were unable to gather more data due to the restrictions on the twitter search API. Howeverm, it looks promising that with more data we would be able to determine if there is a correlation between twitter sentiment and MSFT value.
There are some other factors associated with stock price other than tweets that can be included in futher analysis. Also tweets may not be the best predictor for stocks due to the fact that people could tweet mentioning Microsoft or a company but be discussing things tangential to the actual company.
This process was built with teh intention of finding the correlation of twitter sentiment and stock price but this process can be used to find out the response of the general public via tweets for new product launches and product reviews or similar use cases.