01.Abstract


The financial markets contain a plethora of statistical patterns. The behavior of those patterns is similar with the behavior of the natural phenomena patterns. That means that both are affected by unknown and unstable variables. Which leads to high unpredictability and volatility. That makes almost impossible to forecast future behavior.

As Burton Malkiel, who argues in his 1973 book, “A Random Walk Down Wall Street,”

02.Introduction


Our forecasting methodology is: To analyze and categorize the twitter sentiment regarding Daimler company (positive, newtral, negative) and then analyze to see if there is a correlation between the sentiment and the share price.

The benefit of this methodology is that we can receive the tweets real time and we can make our corresponding corrections on our portfolio

If there is a correlation between the daily price change and the sentiment score of our target share, we could predict the next day’s movement.

Of course our models here can be applied in any kind of opinion measurement. Not only for shares forecasting. Marketing is a domain with many applications of that kind of analysis. More than one billion users are sending and reading “tweets”

03.Methodology


Positive or negative sentiments can be identified for your target company by using Natural Language Processing through word dictionaries and also n-grams model.

Our Daimler related tweets are extracted from the Twitter API. The tweets are filtered by performing sentiment analysis. Tweets are classified as positive, negative and neutral based on the sentiment analysis.

Because tweets are a good example of unstructed data and contain a lot of symbols, hashtags, URL’s, images , video, Etc. We apply many different methods of cleansing such us: corpus formation, stop words removal, and conversion to of the encodings emojicons etc. into ASCII.

After data cleansing, tweets are labelled us “positive” OR “negative” based on the sentiment score. The labelling is performed on basis of N-grams model.

We applied two types of classifiers:

-Naive Bayes Bernoulli -Support Vector Machine

Load required libraries

from the twitter package we use the setup_twitter_oauth function You need first to register an app to twitter development, and then create the keys. I hide my keys.

require(ROAuth)
require(twitteR)

setup_twitter_oauth

mykey = "XXXXXXXXXX"
mysecret = "XXXXXXXXXX"
mytoken = "XXXXXXXXXX"
mysecrettoken = "XXXXXXXXXX"

# it should be on that order
setup_twitter_oauth(mykey, mysecret, mytoken, mysecrettoken)

# from the twitter package we use the searchTwitter function to extract
# tweets ..  Tweeter free api allows only the tweets of the last 10 days
# resultType you can also change it to popular. With AND, OR you can add
# more search keywords

tweets_daimler = searchTwitter("Daimler", n = 10000, resultType = "mixed", lang = "en", 
    retryOnRateLimit = 100)

# we convert the tweets_daimler from list to a data frame
(n.tweet <- length(tweets_daimler))
tweets_daimler.df <- twListToDF(tweets_daimler)
head(tweets_daimler)

require(dplyr)
tweets_daimler.nodups.df <- distinct(tweets_daimler.df, text, .keep_all = TRUE)
tweets_daimler.nodups.df$text <- gsub("… ", "", tweets_daimler.nodups.df$text)
# We change the feature created to Date
tweets_daimler.nodups.df <- plyr::rename(tweets_daimler.nodups.df, c(created = "Date"))
tweets_daimler.nodups.df$Date <- as.Date(tweets_daimler.nodups.df$Date)
# We trandform from datetime to date format
tweets_text <- lapply(tweets_daimler, function(x) x$getText())
# We delete the (retweets)
tweets_unique <- unique(tweets_text)

# We remove the emoticons
tweets_text <- sapply(tweets_text, function(row) iconv(row, "latin1", "ASCII", 
    sub = "byte"))

require(stringr)
functionalText = str_replace_all(tweets_text, "[^[:graph:]]", " ")
# We create the tweets collection with the daimler tweets to use into the
# sentiment analysis
require(tm)

tweets_collections <- Corpus(VectorSource(tweets_unique))
# Distinct the words from the sentence to use them on word cloud plot Ignore
# the warnings into the following scripts

functionalText = str_replace_all(tweets_collections, "[^[:graph:]]", " ")

tweets_collections <- tm_map(tweets_collections, content_transformer(function(x) iconv(x, 
    to = "latin1", "ASCII", sub = "")))

# We change all words from capital to lower case
tweets_collections <- tm_map(tweets_collections, content_transformer(tolower))

# We delete all punctuations
tweets_collections <- tm_map(tweets_collections, removePunctuation)

# From tm package we use the below functions to return various kinds of
# stopwords with support for different languages.
tweets_collections <- tm_map(tweets_collections, function(x) removeWords(x, 
    stopwords()))
tweets_collections <- tm_map(tweets_collections, removeWords, stopwords("english"))
tweets_collections <- tm_map(tweets_collections, removeNumbers)
tweets_collections <- tm_map(tweets_collections, stripWhitespace)

# From tm package we use the below functions to construct or coerce to a
# term-document matrix or a document-term matrix
term_matrix <- TermDocumentMatrix(tweets_collections, control = list(wordLengths = c(1, 
    Inf)))

# view the terms
term_matrix

# From tm package we find frequencyuent terms in a document-term or
# term-document matrix.
(frequency.terms <- findfrequencyTerms(term_matrix, lowfrequency = 20))
# Transform it to matrix
term_matrix <- as.matrix(term_matrix)

# We compute the frequency of the distinct words from the tweets and we sort
# it by frequency
distinctWordsfrequency <- sort(rowSums(term_matrix), decreasing = T)

# Transform it to df
distinctWordsDF <- data.frame(names(distinctWordsfrequency), distinctWordsfrequency)
colnames(distinctWordsDF) <- c("word", "frequency")

`Display the 5 most frequent words

We plot an Interactive colorfoul word cloud with the most frequent words

i used the famous lexicon of posive and negative words that was created from : Liu, Bing, Minqing Hu, and Junsheng Cheng. “Opinion observer: analyzing and comparing opinions on the web.” In Proceedings of the 14th international conference on World Wide Web (WWW-2005), pp. 342-351. ACM, May 10-14, 2005. Thanks to Liu and Hu we will add more than 6500 positive phrases, idioms and quotes

We will add some extra words, that were observed inside the review of tweets We merge the above two positive lexicons

04.Score Sentiment Function


We create the score sentiment function in order to run it afterwards in our daimler tweets Transform the vector with the sentences to simple array of scores with plyr package Then we clean the sentences with gsub() function Then we convert the letters from capital to lower case The below function was inspired by (https://medium.com/@rohitnair_94843/analysis-of-twitter-data-using-r-part-3-sentiment-analysis-53d0e5359cb8)

We apply the above sentiment function on our tweets

05.Sentiment Classification using Distant Supervision


We apply the “Sentiment Classification using Distant Supervision” to create the Sentiments this method was first applied by:Huang, Bhayani and Go Based on Huang, Bhayani and Go Twitter “Sentiment Classification using Distant Supervision” research paper for Stanford university published on Dec 18,2017 (https://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) They presented the results of machine learning algorithms for classifying the sentiment of Twitter messages using distant supervision. Their training data consists of Twitter messages with emoticons, which are used as noisy labels. This type of training data is abundantly available and can be obtained through automated means. We show that machine learning algorithms (Naive Bayes, Maximum Entropy, and SVM) have accuracy above 80% when trained with emoticon data.

We create an interactive plot with the scoring results mean per date

We notice that the mean exponentially increased at 10th of May

Now that we have calculated the sentiments, we can run the

06.Distant Supervised Learning Classifier


display the 6 first scores

  score
1     2
2     0
3     0
4    -1
5     0
6     0

Now we create a new df,which is the combine of all the above results with the original daimler tweets df

We plot an interactive histogram of sentiment for all tweets

Plot interactive scatter plot with tweet date vs sentiment score

We create the percentages of the 7 different sentiments of our tweets

Perc = daimler.scores$score

# The output will be FALSE or TRUE ,Sentiment good
good <- sapply(Perc, function(Perc) Perc <= 3 && Perc >= 1)

# We convert it to actual value
Perc[good]
list_good = Perc[good]
value_good = length(list_good)

# Sentiment Very good
verygood <- sapply(Perc, function(Perc) Perc > 3 && Perc < 6)
# We convert it to actual value
Perc[verygood]
list_verygood = Perc[verygood]
value_verygood = length(list_verygood)

# Sentiment Outstanding
outstanding <- sapply(Perc, function(Perc) Perc >= 6)
# We convert it to actual value
Perc[outstanding]
list_outstanding = Perc[outstanding]
value_outstanding = length(list_outstanding)

# Sentiment Bad : Unsatisfactory Output of following is FALSE or TRUE
bad <- sapply(Perc, function(Perc) Perc >= -3 && Perc <= -1)
# We convert it to actual value
Perc[bad]
list_bad = Perc[bad]
value_bad = length(list_bad)

# Sentiment Very bad : Poor Output of following is FALSE or TRUE
verybad <- sapply(Perc, function(Perc) Perc < -3 && Perc > -6)
# We convert it to actual value
Perc[verybad]
list_verybad = Perc[verybad]
value_verybad = length(list_verybad)

# Sentiment extremely bad
extremelybad <- sapply(Perc, function(Perc) Perc <= -6)
# We convert it to actual value
Perc[extremelybad]
list_extremelybad = Perc[extremelybad]
value_extremelybad = length(list_extremelybad)

# Sentiment Neutral
neutral <- sapply(Perc, function(Perc) Perc > -1 && Perc < 1)
list_neutral = Perc[neutral]
value_neutral = length(list_neutral)

slices <- c(value_good, value_extremelybad, value_bad, value_verygood, value_verybad, 
    value_neutral, value_outstanding)
lbls <- c("Good", "Extremely Bad", "Bad", "Great", "Poor", "Neutral", "Outstanding")

# Check for 0's
slices
# We see that we have 0 extremely bad tweets so we remove it
slices <- c(value_good, value_bad, value_verygood, value_verybad, value_neutral, 
    value_outstanding)
lbls <- c("Good", "Bad", "Great", "Poor", "Neutral", "Outstanding")
pct <- round(slices/sum(slices1) * 100)  #We add percentage to use it in the pie
lbls <- paste(lbls, pct)  # add percentage to the labels 
lbls <- paste(lbls, "%", sep = "")  # we add the symbol % to labels 

We create a 3D Pie chart with percentages of positive, negstive, neutral

We create 3 main sentiment categories : positive - negative - neutral

We plot the sentiment categories (positive, negative and neutral) per day

07.Daimler share price calculation


Now is the part that i will calculate the stock price of Daimler I downloaded the csv from yahoo finance and used the same 10 days that i collected the tweets Import the daimler stock index from my github

daimler_stock_prices <- read.csv("https://raw.githubusercontent.com/papacosmas/daimler_sentiment_stock_analysis/master/DDAIF.csv", 
    header = TRUE)

# We format the date feature in order R know that this is a date feature

daimler_stock_prices$Date <- as.Date(daimler_stock_prices$Date, format = "%Y-%m-%d")

daimler_stock_prices$Date <- as.Date(strptime(daimler_stock_prices$Date, format = "%Y-%m-%d"))

# Now we create a new df with the above sentiment analysis merged df plus
# the stock prices index Be carefull to have the same dates to both
require(dplyr)
tweets_plus_stock <- left_join(daimler.score.merge, daimler_stock_prices, by = "Date")

# create a new df in order to remove the rows that we dont have adjusted
# closing entry. (adj closing is the closing price of the share that
# corresponding date)
weekday_tweets_plus_stock <- tweets_plus_stock
weekday_tweets_plus_stock <- subset(tweets_plus_stock, !is.na(Adj.Close))

# We add new features fields (as 0,1 factor) to mark tweets as positive or
# negative or neutral
weekday_tweets_plus_stock$positive <- as.numeric(weekday_tweets_plus_stock$score > 
    0)

weekday_tweets_plus_stock$negative <- as.numeric(weekday_tweets_plus_stock$score < 
    0)

weekday_tweets_plus_stock$neutral <- as.numeric(weekday_tweets_plus_stock$score == 
    0)

# We create a new df with sums. From one row per tweet - to one row per day.
# Showing the sum positives, negatives and newtral tweets per day
require(plyr)
tweets_plus_stock_df <- ddply(weekday_tweets_plus_stock, c("Date", "High", "Low", 
    "Adj.Close"), plyr::summarise, pos.count = sum(positive), neg.count = sum(negative), 
    neu.count = sum(neutral))

# We add a new feature with the sum tweets of the 3 sentiment categories
tweets_plus_stock_df$sum.count <- tweets_plus_stock_df$pos.count + tweets_plus_stock_df$neg.count + 
    tweets_plus_stock_df$neu.count

# We calculate the percentage of negative tweets for each day, and add it as
# a new feature
tweets_plus_stock_df$percentage.negatives <- round((tweets_plus_stock_df$neg.count/tweets_plus_stock_df$sum.count) * 
    100)

Simple correlation

08.Correlation analysis


From the scatterplot we can identify a negative relationship between (percentage of negative tweets vs stock price) That means as the percentage of negative tweets increase, the value of the stock decreases.

Into the Pearson correlation coefficient, we see that there is a medium negative correlation (-0.473). That means as the percentage of negative tweets increase, the value of the stock decreases.

Into the kernel density plot we can observe strong negative skew with high bandwidth correlation coefficients Density plots for every numeric continuous variable help us to identify skewness, kurtosis and distribution information. Q-Q normal plots can also be useful at diagnosing departures from normality by comparing the data quantiles to those of a standard normal distribution. Substantial deviations from linearity, indicate departures from normality. Quantiles are a regular spacing of points throughout an ordered data set.

[1] -0.4727354

09.Fitting Generalized Linear Models from stats package



Call:
glm(formula = tweets_plus_stock_df$Adj.Close ~ tweets_plus_stock_df$percentage.positives)

Deviance Residuals: 
       1         2         3         4         5         6         7  
 0.05639   0.95243   1.45623   0.32719  -1.38426  -0.37613  -1.03185  

Coefficients:
                                           Estimate Std. Error t value
(Intercept)                               64.127683   1.089809  58.843
tweets_plus_stock_df$percentage.positives  0.004759   0.020203   0.236
                                          Pr(>|t|)    
(Intercept)                               2.68e-08 ***
tweets_plus_stock_df$percentage.positives    0.823    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1.252059)

    Null deviance: 6.3298  on 6  degrees of freedom
Residual deviance: 6.2603  on 5  degrees of freedom
AIC: 25.083

Number of Fisher Scoring iterations: 2

Plot of % negative tweets vs daily change in stock price with linear regression line overlaid

fit a loss line

Add new feature by calculating the percentage of positive tweets per day

Simple correlation

[1] 0.1047692

Call:
glm(formula = tweets_plus_stock_df$Adj.Close ~ tweets_plus_stock_df$percentage.positives)

Deviance Residuals: 
       1         2         3         4         5         6         7  
 0.05639   0.95243   1.45623   0.32719  -1.38426  -0.37613  -1.03185  

Coefficients:
                                           Estimate Std. Error t value
(Intercept)                               64.127683   1.089809  58.843
tweets_plus_stock_df$percentage.positives  0.004759   0.020203   0.236
                                          Pr(>|t|)    
(Intercept)                               2.68e-08 ***
tweets_plus_stock_df$percentage.positives    0.823    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1.252059)

    Null deviance: 6.3298  on 6  degrees of freedom
Residual deviance: 6.2603  on 5  degrees of freedom
AIC: 25.083

Number of Fisher Scoring iterations: 2

Plot of % positive tweets vs daily change in stock price with linear regression line overlaid

10.Conclusion


The predictions obtained with the aid of sentiments analysis presented significantly good potential for share market forecasting. At least on a short-term basis.

I would propose to create a model that would
analyze and combine the results of:

-Social media sentiment
-Economic News
-Market Time Series Analysis

Thank you for reading my analysis
KR
Niko

11.Contact


https://www.linkedin.com/in/niko-papacosmas-mba-pmp-mcse-695a2695/

Disclaimer This article is intended for academic and educational purposes and is not an investment recommendation. The information that we provide or should not be a substitute for advice from an investment professional. The models discussed in this paper do not reflect the investment performance. A decision to invest in any product or strategy should not be based on the information or conclusions contained herein. This is neither an offer to sell / buy nor a solicitation for an offer to buy interests in securities.