The financial markets contain a plethora of statistical patterns. The behavior of those patterns is similar with the behavior of the natural phenomena patterns. That means that both are affected by unknown and unstable variables. Which leads to high unpredictability and volatility. That makes almost impossible to forecast future behavior.
As Burton Malkiel, who argues in his 1973 book, “A Random Walk Down Wall Street,”
Our forecasting methodology is: To analyze and categorize the twitter sentiment regarding Daimler company (positive, newtral, negative) and then analyze to see if there is a correlation between the sentiment and the share price.
The benefit of this methodology is that we can receive the tweets real time and we can make our corresponding corrections on our portfolio
If there is a correlation between the daily price change and the sentiment score of our target share, we could predict the next day’s movement.
Of course our models here can be applied in any kind of opinion measurement. Not only for shares forecasting. Marketing is a domain with many applications of that kind of analysis. More than one billion users are sending and reading “tweets”
Positive or negative sentiments can be identified for your target company by using Natural Language Processing through word dictionaries and also n-grams model.
Our Daimler related tweets are extracted from the Twitter API. The tweets are filtered by performing sentiment analysis. Tweets are classified as positive, negative and neutral based on the sentiment analysis.
Because tweets are a good example of unstructed data and contain a lot of symbols, hashtags, URL’s, images , video, Etc. We apply many different methods of cleansing such us: corpus formation, stop words removal, and conversion to of the encodings emojicons etc. into ASCII.
After data cleansing, tweets are labelled us “positive” OR “negative” based on the sentiment score. The labelling is performed on basis of N-grams model.
We applied two types of classifiers:
-Naive Bayes Bernoulli -Support Vector Machine
Load required libraries
library(utils)
library(GGally)
library(devtools)
library(twitteR)
library(tm)
library(wordcloud2)
library(plyr)
library(stringr)
library(data.table)
library(dplyr)
library(ggthemes)
library(plotly)
library(plotrix)
library(httr)
library(ROAuth)
library(dplyr)
library(NLP)
library(wordcloud2)
library(base64enc)
library(wordcloud)
library(plotrix)
library(ggplot2)
library(lattice)
library(devtools)
library(rjson)
library(bit64)
from the twitter package we use the setup_twitter_oauth function
You need first to register an app to twitter development, and then create the keys.
I hide my keys.
require(ROAuth)
require(twitteR)
setup_twitter_oauth
mykey = "XXXXXXXXXX"
mysecret = "XXXXXXXXXX"
mytoken = "XXXXXXXXXX"
mysecrettoken = "XXXXXXXXXX"
# it should be on that order
setup_twitter_oauth(mykey, mysecret, mytoken, mysecrettoken)
# from the twitter package we use the searchTwitter function to extract
# tweets .. Tweeter free api allows only the tweets of the last 10 days
# resultType you can also change it to popular. With AND, OR you can add
# more search keywords
tweets_daimler = searchTwitter("Daimler", n = 10000, resultType = "mixed", lang = "en",
retryOnRateLimit = 100)
# we convert the tweets_daimler from list to a data frame
(n.tweet <- length(tweets_daimler))
tweets_daimler.df <- twListToDF(tweets_daimler)
head(tweets_daimler)
require(dplyr)
tweets_daimler.nodups.df <- distinct(tweets_daimler.df, text, .keep_all = TRUE)
tweets_daimler.nodups.df$text <- gsub("… ", "", tweets_daimler.nodups.df$text)
# We change the feature created to Date
tweets_daimler.nodups.df <- plyr::rename(tweets_daimler.nodups.df, c(created = "Date"))
tweets_daimler.nodups.df$Date <- as.Date(tweets_daimler.nodups.df$Date)
# We trandform from datetime to date format
tweets_text <- lapply(tweets_daimler, function(x) x$getText())
# We delete the (retweets)
tweets_unique <- unique(tweets_text)
# We remove the emoticons
tweets_text <- sapply(tweets_text, function(row) iconv(row, "latin1", "ASCII",
sub = "byte"))
require(stringr)
functionalText = str_replace_all(tweets_text, "[^[:graph:]]", " ")
# We create the tweets collection with the daimler tweets to use into the
# sentiment analysis
require(tm)
tweets_collections <- Corpus(VectorSource(tweets_unique))
# Distinct the words from the sentence to use them on word cloud plot Ignore
# the warnings into the following scripts
functionalText = str_replace_all(tweets_collections, "[^[:graph:]]", " ")
tweets_collections <- tm_map(tweets_collections, content_transformer(function(x) iconv(x,
to = "latin1", "ASCII", sub = "")))
# We change all words from capital to lower case
tweets_collections <- tm_map(tweets_collections, content_transformer(tolower))
# We delete all punctuations
tweets_collections <- tm_map(tweets_collections, removePunctuation)
# From tm package we use the below functions to return various kinds of
# stopwords with support for different languages.
tweets_collections <- tm_map(tweets_collections, function(x) removeWords(x,
stopwords()))
tweets_collections <- tm_map(tweets_collections, removeWords, stopwords("english"))
tweets_collections <- tm_map(tweets_collections, removeNumbers)
tweets_collections <- tm_map(tweets_collections, stripWhitespace)
# From tm package we use the below functions to construct or coerce to a
# term-document matrix or a document-term matrix
term_matrix <- TermDocumentMatrix(tweets_collections, control = list(wordLengths = c(1,
Inf)))
# view the terms
term_matrix
# From tm package we find frequencyuent terms in a document-term or
# term-document matrix.
(frequency.terms <- findfrequencyTerms(term_matrix, lowfrequency = 20))
# Transform it to matrix
term_matrix <- as.matrix(term_matrix)
# We compute the frequency of the distinct words from the tweets and we sort
# it by frequency
distinctWordsfrequency <- sort(rowSums(term_matrix), decreasing = T)
# Transform it to df
distinctWordsDF <- data.frame(names(distinctWordsfrequency), distinctWordsfrequency)
colnames(distinctWordsDF) <- c("word", "frequency")
`Display the 5 most frequent words
We plot an Interactive colorfoul word cloud with the most frequent words
i used the famous lexicon of posive and negative words that was created from : Liu, Bing, Minqing Hu, and Junsheng Cheng. “Opinion observer: analyzing and comparing opinions on the web.” In Proceedings of the 14th international conference on World Wide Web (WWW-2005), pp. 342-351. ACM, May 10-14, 2005. Thanks to Liu and Hu we will add more than 6500 positive phrases, idioms and quotes
We will add some extra words, that were observed inside the review of tweets
We merge the above two positive lexicons
pos.words <- c(hu.liu.positive, PositiveWordsResearch)
# We add the extra words we noticed
pos.words <- c(pos.words, "thanx", "awesome", "fantastic", "super", "prima",
"toll", "cool", "geil", "profit", "profits", "earnings", "congrats", "prizes",
"prize", "thanks", "thnx", "Grt", "gr8", "plz", "trending", "recovering",
"brainstorm", "leader")
neg.words <- c(hu.liu.negative, "avoid", "lose", "loses", "scandal", "dieselgate",
"sucks", "awful", "disgusting", "negative", "wait", "waiting", "hold", "onhold",
"on hold", "cancel", "spam", "spams", "cancel", "wth", "Fight", "fighting",
"wtf", "arrest", "no", "not")
We create the score sentiment function in order to run it afterwards in our daimler tweets Transform the vector with the sentences to simple array of scores with plyr package Then we clean the sentences with gsub() function Then we convert the letters from capital to lower case The below function was inspired by (https://medium.com/@rohitnair_94843/analysis-of-twitter-data-using-r-part-3-sentiment-analysis-53d0e5359cb8)
require(plyr)
require(stringr)
score.sentiment = function(sentences, pos.words, neg.words, .progress = "none") {
scores = laply(sentences, function(sentence, pos.words, neg.words) {
sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)
# With stringr package we distinct it into words
word.list = str_split(sentence, "\\s+")
# We unlist the vector
words = unlist(word.list)
# We compare the words from daimler tweets, with the above positive and
# negative words of the dictionaries
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# The match function returned the position of the matched word otherwise NA
# Remove NAs
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# We tranform the matches into 1 or 0 with the sum fiunction
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress = .progress)
scores.df = data.frame(score = scores, text = sentences)
return(scores.df)
}
We apply the above sentiment function on our tweets
We apply the “Sentiment Classification using Distant Supervision” to create the Sentiments this method was first applied by:Huang, Bhayani and Go Based on Huang, Bhayani and Go Twitter “Sentiment Classification using Distant Supervision” research paper for Stanford university published on Dec 18,2017 (https://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) They presented the results of machine learning algorithms for classifying the sentiment of Twitter messages using distant supervision. Their training data consists of Twitter messages with emoticons, which are used as noisy labels. This type of training data is abundantly available and can be obtained through automated means. We show that machine learning algorithms (Naive Bayes, Maximum Entropy, and SVM) have accuracy above 80% when trained with emoticon data.
require(sentiment)
sentiments <- sentiment(tweets_unique)
table(sentiments$polarity)
sentiments$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
require(data.table)
sentiments$date <- as.IDate(tweets_daimler.nodups.df$Date)
result <- aggregate(score ~ date, data = sentiments, mean)
We create an interactive plot with the scoring results mean per date
require(ggthemes)
require(ggplot2)
require(plotly)
plotresults <- ggplot(result, aes(x = date, y = score)) + xlab("Date") + ylab("Sentiment Scoring Mean") +
ggtitle("Sentiment Scoring Mean by Date") + theme_solarized(light = FALSE) +
geom_path(color = "yellow", size = 1) + geom_point(color = "red", size = 3)
ggplotly(plotresults)
We notice that the mean exponentially increased at 10th of May
Now that we have calculated the sentiments, we can run the
display the 6 first scores
score
1 2
2 0
3 0
4 -1
5 0
6 0
Now we create a new df,which is the combine of all the above results with the
original daimler tweets df
tweets_daimler.nodups.df$text = tweets_unique
daimler.score.merge <- merge(daimler.scores, tweets_daimler.nodups.df, by = "text")
We plot an interactive histogram of sentiment for all tweets
plothist <- ggplot(daimler.scores, aes(x = daimler.scores$score, fill = "score",
)) + xlab("Date") + ylab("Sentiment Scoring") + ggtitle("Sentiment of tweets that mention Daimler by Date") +
theme_solarized(light = FALSE) + geom_histogram()
ggplotly(plothist)
Plot interactive scatter plot with tweet date vs sentiment score
plotscatter <- ggplot(NULL, aes(tweets_daimler.nodups.df$Date, daimler.scores$score)) +
geom_point(data = tweets_daimler.nodups.df) + geom_point(data = daimler.scores,
color = "red") + xlab("Date") + ylab("Sentiment Scoring") + ggtitle("Sentiment of tweets that mention Daimler by Date") +
theme_solarized(light = FALSE)
ggplotly(plotscatter)
We create the percentages of the 7 different sentiments of our tweets
Perc = daimler.scores$score
# The output will be FALSE or TRUE ,Sentiment good
good <- sapply(Perc, function(Perc) Perc <= 3 && Perc >= 1)
# We convert it to actual value
Perc[good]
list_good = Perc[good]
value_good = length(list_good)
# Sentiment Very good
verygood <- sapply(Perc, function(Perc) Perc > 3 && Perc < 6)
# We convert it to actual value
Perc[verygood]
list_verygood = Perc[verygood]
value_verygood = length(list_verygood)
# Sentiment Outstanding
outstanding <- sapply(Perc, function(Perc) Perc >= 6)
# We convert it to actual value
Perc[outstanding]
list_outstanding = Perc[outstanding]
value_outstanding = length(list_outstanding)
# Sentiment Bad : Unsatisfactory Output of following is FALSE or TRUE
bad <- sapply(Perc, function(Perc) Perc >= -3 && Perc <= -1)
# We convert it to actual value
Perc[bad]
list_bad = Perc[bad]
value_bad = length(list_bad)
# Sentiment Very bad : Poor Output of following is FALSE or TRUE
verybad <- sapply(Perc, function(Perc) Perc < -3 && Perc > -6)
# We convert it to actual value
Perc[verybad]
list_verybad = Perc[verybad]
value_verybad = length(list_verybad)
# Sentiment extremely bad
extremelybad <- sapply(Perc, function(Perc) Perc <= -6)
# We convert it to actual value
Perc[extremelybad]
list_extremelybad = Perc[extremelybad]
value_extremelybad = length(list_extremelybad)
# Sentiment Neutral
neutral <- sapply(Perc, function(Perc) Perc > -1 && Perc < 1)
list_neutral = Perc[neutral]
value_neutral = length(list_neutral)
slices <- c(value_good, value_extremelybad, value_bad, value_verygood, value_verybad,
value_neutral, value_outstanding)
lbls <- c("Good", "Extremely Bad", "Bad", "Great", "Poor", "Neutral", "Outstanding")
# Check for 0's
slices
# We see that we have 0 extremely bad tweets so we remove it
slices <- c(value_good, value_bad, value_verygood, value_verybad, value_neutral,
value_outstanding)
lbls <- c("Good", "Bad", "Great", "Poor", "Neutral", "Outstanding")
pct <- round(slices/sum(slices1) * 100) #We add percentage to use it in the pie
lbls <- paste(lbls, pct) # add percentage to the labels
lbls <- paste(lbls, "%", sep = "") # we add the symbol % to labels
We create a 3D Pie chart with percentages of positive, negstive, neutral
We create 3 main sentiment categories : positive - negative - neutral
sentimCategories <- daimler.scores$score
require(plyr)
sentimCategories <- plyr::mutate(daimler.scores, tweet = ifelse(daimler.scores$score >
0, "positive", ifelse(daimler.scores$score < 0, "negative", "neutral")))
# Ignore warnings
require(dplyr)
require(utils)
lapply(paste("package:", names(sessionInfo()$otherPkgs), sep = ""), detach,
character.only = TRUE, unload = TRUE)
byTweet <- dplyr::group_by(sentimCategories, tweet, tweets_daimler.nodups.df$Date)
byTweet <- dplyr::summarise(byTweet, number = n())
We plot the sentiment categories (positive, negative and neutral) per day
require(ggplot2)
require(ggthemes)
ggplot(byTweet, aes(byTweet$`tweets_daimler.nodups.df$Date`, byTweet$number)) +
xlab("Date") + ylab("Number Of Tweets") + ggtitle("Daimler Tweets Sentiments per Date") +
geom_line(aes(group = tweet, color = tweet), size = 2) + geom_point(aes(group = tweet,
color = tweet), size = 4) + theme(text = element_text(size = 18), axis.text.x = element_text(angle = 90,
vjust = 1)) + theme_solarized(light = FALSE)
From the scatterplot we can identify a negative relationship between (percentage of negative tweets vs stock price) That means as the percentage of negative tweets increase, the value of the stock decreases.
Into the Pearson correlation coefficient, we see that there is a medium negative correlation (-0.473). That means as the percentage of negative tweets increase, the value of the stock decreases.
Into the kernel density plot we can observe strong negative skew with high bandwidth correlation coefficients Density plots for every numeric continuous variable help us to identify skewness, kurtosis and distribution information. Q-Q normal plots can also be useful at diagnosing departures from normality by comparing the data quantiles to those of a standard normal distribution. Substantial deviations from linearity, indicate departures from normality. Quantiles are a regular spacing of points throughout an ordered data set.
library(stats)
cor(tweets_plus_stock_df$percentage.negative, tweets_plus_stock_df$Adj.Close,
use = "complete")
[1] -0.4727354
Call:
glm(formula = tweets_plus_stock_df$Adj.Close ~ tweets_plus_stock_df$percentage.positives)
Deviance Residuals:
1 2 3 4 5 6 7
0.05639 0.95243 1.45623 0.32719 -1.38426 -0.37613 -1.03185
Coefficients:
Estimate Std. Error t value
(Intercept) 64.127683 1.089809 58.843
tweets_plus_stock_df$percentage.positives 0.004759 0.020203 0.236
Pr(>|t|)
(Intercept) 2.68e-08 ***
tweets_plus_stock_df$percentage.positives 0.823
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1.252059)
Null deviance: 6.3298 on 6 degrees of freedom
Residual deviance: 6.2603 on 5 degrees of freedom
AIC: 25.083
Number of Fisher Scoring iterations: 2
Plot of % negative tweets vs daily change in stock price
with linear regression line overlaid
fit a loss line
require(lattice)
xyplot(tweets_plus_stock_df$Adj.Close ~ tweets_plus_stock_df$percentage.negatives,
grid = TRUE, type = c("p", "r"), col.line = "red", lwd = 3, ylab = "Daily Change of Daimler Share Price in $",
xlab = "% of Negative Tweets", main = "% of negative Tweets vs Daily
Daimler Share Price")
Add new feature by calculating the percentage of positive tweets per day
tweets_plus_stock_df$percentage.positives <- round((tweets_plus_stock_df$pos.count/tweets_plus_stock_df$sum.count) *
100)
Simple correlation
library(stats)
cor(tweets_plus_stock_df$percentage.positives, tweets_plus_stock_df$Adj.Close,
use = "complete")
[1] 0.1047692
Call:
glm(formula = tweets_plus_stock_df$Adj.Close ~ tweets_plus_stock_df$percentage.positives)
Deviance Residuals:
1 2 3 4 5 6 7
0.05639 0.95243 1.45623 0.32719 -1.38426 -0.37613 -1.03185
Coefficients:
Estimate Std. Error t value
(Intercept) 64.127683 1.089809 58.843
tweets_plus_stock_df$percentage.positives 0.004759 0.020203 0.236
Pr(>|t|)
(Intercept) 2.68e-08 ***
tweets_plus_stock_df$percentage.positives 0.823
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1.252059)
Null deviance: 6.3298 on 6 degrees of freedom
Residual deviance: 6.2603 on 5 degrees of freedom
AIC: 25.083
Number of Fisher Scoring iterations: 2
Plot of % positive tweets vs daily change in stock price
with linear regression line overlaid
require(lattice)
xyplot(tweets_plus_stock_df$Adj.Close ~ tweets_plus_stock_df$percentage.positives,
grid = TRUE, type = c("p", "r"), col.line = "blue", lwd = 3, ylab = "Daily Change of Daimler Share Price in $",
xlab = "% of Positive Tweets", main = "% of positive Tweets vs Daily
Daimler Share Price")
The predictions obtained with the aid of sentiments analysis presented significantly good potential for share market forecasting. At least on a short-term basis.
I would propose to create a model that would
analyze and combine the results of:
-Social media sentiment
-Economic News
-Market Time Series Analysis
Thank you for reading my analysis
KR
Niko
https://www.linkedin.com/in/niko-papacosmas-mba-pmp-mcse-695a2695/
Disclaimer This article is intended for academic and educational purposes and is not an investment recommendation. The information that we provide or should not be a substitute for advice from an investment professional. The models discussed in this paper do not reflect the investment performance. A decision to invest in any product or strategy should not be based on the information or conclusions contained herein. This is neither an offer to sell / buy nor a solicitation for an offer to buy interests in securities.