Load R Packages

# Load Requried Packages
library(tm)
library(lda)
library(httr)
library(dplyr)
library(tidyr)
library(anytime)
library(stringi)
library(twitteR)
library(syuzhet)
library(tidytext)
library(tidyverse)
library(SnowballC)
library(topicmodels)
library(wordcloud)
library(BiocManager)

# Package required for running Twitter API authorization.
installed.packages('base64enc')
##      Package LibPath Version Priority Depends Imports LinkingTo Suggests
##      Enhances License License_is_FOSS License_restricts_use OS_type Archs
##      MD5sum NeedsCompilation Built

Introduction

Donald Trump changed the communication platform of politics from burocratic aproaches of scheduled and managed political speeches to direct communication via Twitter. He started using twitter heavily on his 2016 persidential campaign and has not looked back since. His tweets has been analyzed by variety of researchers from frequency of “angry” tweets, his emotional state during the times of his tweets, the type of tweets he sends with specific mobile devices to his tweets impact on financial markets. In this study, the main business question we are trying solve is “Can we leverage President Trump’s trade or interest rate related tweets and predict the market?” We review the tweets between January 2018 to present, classify his tweets based on their topics and context related to trade wars, interest rate, employment in the US and conusmer spending , create a model and perform sentiment analysis.

Overall the main goal of this study is to see the classified tweets of Donald Trump, discover possible relationship with the stock market and to see how the context of text used on his account impacts the stock market. In order to do this, we will identify and describe common topics and use of text that can change the market in the corpus of the tweets that is sent from the @realDonaldTrump twitter account. We can further compare the stock market data against these tweets to see if there is any correlation and if we can create a topic model and sentiment analysis that can predict the stock market.

Data Collection

Based on the business problem in question, the content of the required data is Tweets and Stock Market Data. They are available via Twitter and Financial news platforms.

Donald Trump’s Tweets

Twitter’s developer account provides many API procducts including tools to extract tweets and their metadata. We will use this API to extract the wtitter data in a structured format to further wrangling and analysis. In order to use the twitter API we created a twitter account and requested developer API access. Once we received an approval, we have been provided API key and Token access information. We will be using these keys, tokens to access the API and “twitterR” to extract Donald Trump’s tweets.

Connect to Twitter through API

# Authorization keys.
app_name <- "JAS"
consumer_key <- 'sPwbbZCtf8nfSMxhYTzqI8WHJ'
consumer_secret <- 'KfcOxgElcQ70fi3QNy8LkuDAN18dunXT147MoA8aBOLzpr3Vd3'
access_token <- '600477513-rdd3Fcywq1sfnh5S60egRQxXh0TlDqfrLzyZo4Vk'
access_secret <- 'SdDFCJUOoqAwt671VXeLaD781TdUYdeBSW2gyQMG4P5Zh'

# Extract some tweets from Twitter.
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
tweets <- userTimeline("realDonaldTrump", n=5)
tweets
## [[1]]
## [1] "realDonaldTrump: Without the horror show that is the Radical Left, Do Nothing Democrats, the Stock Markets and Economy would be even… https://t.co/emGpx5ezOD"

Upon extracting tweeter data via Twitter API and converting to dataframe, we notice that there is a limitation on the number of tweets (3200) we can extract using twitter API. This is due to our account being “Free Developer Account” and in order us to increase the tweet account, we are required to upgrade our account. Since this might become problematic and can put a damper on our analysis and future model, we think it will be better to use a service called http://www.trumptwitterarchive.com/archive that archives all Donald Trump’s tweets.

Load Data

# President Trump tweets from 01/01/2018 to 11/21/2019.
tweets_raw <- read.csv("https://raw.githubusercontent.com/SieSiongWong/Twitter/dev/trumptweets.csv")

# S&P stock price data from year 01/04/2016 o 11/22/2019.
stocks_raw <- read.csv("https://raw.githubusercontent.com/SieSiongWong/Twitter/dev/sandp.csv")

head(tweets_raw)
##               source
## 1 Twitter for iPhone
## 2 Twitter for iPhone
## 3 Twitter for iPhone
## 4 Twitter for iPhone
## 5 Twitter for iPhone
## 6 Twitter for iPhone
##                                                                                                                                           text
## 1                                                                                                                      https://t.co/osezECwPO1
## 2                                                                   Poll: Trump leads top 2020 Democrats in Wisconsin https://t.co/P7z7Si8I0h.
## 3 RT @realDonaldTrump: Impeachment Witch Hunt is now OVER! Ambassador Sondland asks U.S. President (me): “What do you want from Ukraine? I ke…
## 4  RT @realDonaldTrump: ....”I WANT NOTHING! I WANT NOTHING! I WANT NO QUID PRO QUO! TELL PRESIDENT ZELENSKY TO DO THE RIGHT THING!” Later Am…
## 5                     “All four of Gordon Sondland’s lawyers are Democrat Donors.” @TuckerCarlson  Despite this big win today for Republicans!
## 6                                                                         Watch @TuckerCarlson @seanhannity @IngrahamAngle. Big News! @FoxNews
##        created_at retweet_count favorite_count is_retweet       id_str
## 1 11/21/2019 2:47         24221          62863      false 1.197346e+18
## 2 11/21/2019 1:22         14184          52661      false 1.197324e+18
## 3 11/21/2019 1:16         23988              0       true 1.197323e+18
## 4 11/21/2019 1:16         18754              0       true 1.197323e+18
## 5 11/21/2019 1:11         16331          60155      false 1.197322e+18
## 6 11/21/2019 1:03          9837          37564      false 1.197320e+18
head(stocks_raw)
##         Date    Open    High     Low   Close Adj.Close     Volume
## 1 2016-01-04 2038.20 2038.20 1989.68 2012.66   2012.66 4304880000
## 2 2016-01-05 2013.78 2021.94 2004.17 2016.71   2016.71 3706620000
## 3 2016-01-06 2011.71 2011.71 1979.05 1990.26   1990.26 4336660000
## 4 2016-01-07 1985.32 1985.32 1938.83 1943.09   1943.09 5076590000
## 5 2016-01-08 1945.97 1960.40 1918.46 1922.03   1922.03 4664940000
## 6 2016-01-11 1926.12 1935.65 1901.10 1923.67   1923.67 4607290000

Description of the variables in our Twitter data set is as follows;

  • text: Content of the tweet.

  • created: Date and Time the tweet is created.

  • Retweet: The count of retweet of the tweet.

  • Favorite: The count of favorited of the tweet.

Description of the variables in our Stock Markget data set is as folows;

  • Date: The Date of the Stock Market Ticket Value.

  • Open: The Value of the ticket on open of the market.

  • High: The highest value of the ticket during the trading date.

  • Low: Lowest value of the ticket during the trading date.

  • Close: The close value of the ticket during the trading date.

  • Adj. Close: The adjusted close value of the ticket during the trading date.

  • Volume: The trading volume of the ticket during the trading date.

Data Cleaning and Preperation

In this phase of the study, we will construct and clean both Stock Market and Tweets Data Set. The cleaning phase will include, updating the date class, filtering the dataset based on our analysis goal, transforming values such as percentage change in stock value, removing unwanted characters from text and selecting only the columns we need. We will further tokenize the text within tweets data set to see the word frequency and create Document Term Matrix as part of pre-processing.

Stock Data Cleaning

# Update Date column into date format.
stocks_raw$Date <- as.Date(stocks_raw$Date)

# Select data from 01/01/2018 to 11/20/2019 and calculate price change percentage between close and open price.
stocks.df <- stocks_raw %>% 
  filter(between(Date, as.Date("2018-01-01"),as.Date("2019-11-20"))) %>%
  mutate(Pct_Change=(Close-Open)/Open*100)

head(stocks.df)
##         Date    Open    High     Low   Close Adj.Close     Volume
## 1 2018-01-02 2683.73 2695.89 2682.36 2695.81   2695.81 3367250000
## 2 2018-01-03 2697.85 2714.37 2697.77 2713.06   2713.06 3538660000
## 3 2018-01-04 2719.31 2729.29 2719.07 2723.99   2723.99 3695260000
## 4 2018-01-05 2731.33 2743.45 2727.92 2743.15   2743.15 3236620000
## 5 2018-01-08 2742.67 2748.51 2737.60 2747.71   2747.71 3242650000
## 6 2018-01-09 2751.15 2759.14 2747.86 2751.29   2751.29 3453480000
##    Pct_Change
## 1 0.450122743
## 2 0.563780805
## 3 0.172099941
## 4 0.432749747
## 5 0.183763965
## 6 0.005093761

Tweets Data Cleaning and Preperation

# Extract columns from trumptweets.csv file that are useful for analysis.
tweets_slc <- tweets_raw %>% select(source, text, created_at) 

# Remove source other than iphone.
tweets_slc <- tweets_slc %>% filter(source=="Twitter for iPhone")

# Drop source column.
tweets_slc <- tweets_slc %>% select(text, created_at)

# Separate column "created_at" into "date" and "hour".
tweets_slc <- separate(data = tweets_slc, col = created_at, into  = c('date', 'hour'), sep = ' ') %>% select(text, date, hour)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows
## [8596].
# Remove minutes in hour column.
tweets_slc$hour <- gsub("\\:+\\w*","", tweets_slc$hour)

# Remove meaningless characters and symbols.
tweets_slc$text <- gsub("&amp","", tweets_slc$text)
tweets_slc$text <- gsub("(RT)((?:\\b\\w*@\\w+)+)","", tweets_slc$text)
tweets_slc$text <- gsub("^RT","", tweets_slc$text)
tweets_slc$text <- gsub("@\\w+","", tweets_slc$text)
tweets_slc$text <- gsub("[[:punct:]]","", tweets_slc$text)
tweets_slc$text <- gsub("[[:digit:]]+\\s","", tweets_slc$text)
tweets_slc$text <- gsub("http\\w+","", tweets_slc$text)
tweets_slc$text <- gsub("[ \t]{2,}"," ", tweets_slc$text)

# Remove all non-ASCII characters 
tweets_slc$text <- iconv(tweets_slc$text, "UTF-8", "ASCII", sub="")

# Delete empty text column.
tweets_slc <- tweets_slc %>% na_if("") %>% na_if(" ") %>% na.omit()

# Tweets that contained less than 20 characters were treated as noise.
tweets_slc <- tweets_slc %>% filter(nchar(text)>20)

# Add id column to consider each text row as a document.
tweets_slc$doc_id <- seq.int(nrow(tweets_slc))

head(tweets_slc)
##                                                                                                                                                                                           text
## 1                                                                                                                                                 Poll Trump leads top Democrats in Wisconsin 
## 2                                                                               Impeachment Witch Hunt is now OVER Ambassador Sondland asks US President me What do you want from Ukraine I ke
## 3                                                                                  I WANT NOTHING I WANT NOTHING I WANT NO QUID PRO QUO TELL PRESIDENT ZELENSKY TO DO THE RIGHT THING Later Am
## 4                                                                                          All four of Gordon Sondlands lawyers are Democrat Donors Despite this big win today for Republicans
## 5                                                                                                                                                      If this were a prizefight theyd stop it
## 6 Today I opened a major Apple Manufacturing plant in Texas that will bring high paying jobs back to America Today Nancy Pelosi closed Congress because she doesnt care about American Workers
##         date hour doc_id
## 1 11/21/2019    1      1
## 2 11/21/2019    1      2
## 3 11/21/2019    1      3
## 4 11/21/2019    1      4
## 5 11/20/2019   23      5
## 6 11/20/2019   23      6

Tokenizing Text and Word Frequency

# Tokenize the text and see frequency of words.
tweets_slc %>% 
  unnest_tokens(word, text)%>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE) 
## Joining, by = "word"
## # A tibble: 12,793 x 2
##    word          n
##    <chr>     <int>
##  1 president  1219
##  2 people     1010
##  3 democrats   898
##  4 trump       853
##  5 country     729
##  6 news        671
##  7 border      648
##  8 fake        589
##  9 time        478
## 10 media       431
## # ... with 12,783 more rows
# We can see that words such as "president, trump" not pertaining to trade, so we remove them.
tweets_slc <- tweets_slc %>% mutate(text=tolower(text))
tweets_slc$text <- gsub("president?","", tweets_slc$text)
tweets_slc$text <- gsub("trump?","", tweets_slc$text)

# Retokenize the text and check to see if words being removed.
tweets_slc %>% 
  unnest_tokens(word, text)%>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 12,743 x 2
##    word          n
##    <chr>     <int>
##  1 people     1010
##  2 democrats   898
##  3 country     729
##  4 news        671
##  5 border      648
##  6 fake        589
##  7 time        479
##  8 media       431
##  9 america     421
## 10 united      414
## # ... with 12,733 more rows
# creating tweets frequency dataframe

top_words <- tweets_slc %>% 
  unnest_tokens(word, text)%>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE)
## Joining, by = "word"
top_words <- filter(top_words, n>300)
head(top_words)
## # A tibble: 6 x 2
##   word          n
##   <chr>     <int>
## 1 people     1010
## 2 democrats   898
## 3 country     729
## 4 news        671
## 5 border      648
## 6 fake        589

Creating Document Term Matrix

# Select text and id column.
tweetscorpus.df <- tweets_slc %>% select(doc_id, text)

# Create a corpus for document term matrix.
tweetscorpus <- VCorpus(DataframeSource(tweetscorpus.df))

# Remove all punctuation from the corpus.
tweetscorpus <- tm_map(tweetscorpus, removePunctuation)

# Remove all English stopwords from the corpus.
tweetscorpus <- tm_map(tweetscorpus, removeWords, stopwords("en"))
tweetscorpus <- tm_map(tweetscorpus, removeWords, stopwords("SMART"))

# Remove all number from the corpus.
tweetscorpus <- tm_map(tweetscorpus, removeNumbers)

# Strip extra white spaces in the corpus.
tweetscorpus <- tm_map(tweetscorpus, stripWhitespace)

# Stem words in the corpus.
tweetscorpus <- tm_map(tweetscorpus, stemDocument)

# Build a document term matrix.
tweetsdtm <- DocumentTermMatrix(tweetscorpus)

# Remove sparse terms which don't appear very often. Limit the document term matrix to contain terms appearing in at least 2% of documents.
tweetsdtm <- removeSparseTerms(tweetsdtm, 0.98)

# Find the sum of words in each document and remove all docs without words.
rowTotals <- apply(tweetsdtm , 1, sum)
tweetsdtm.new   <- tweetsdtm[rowTotals> 0, ]

# Put the document in the format lda package required
tweetsdtm.matrix <- as.matrix(tweetsdtm.new)

head(tweetsdtm.matrix, n=5)
##     Terms
## Docs administr america american back bad big billion border call campaign
##    1         0       0        0    0   0   0       0      0    0        0
##    2         0       0        0    0   0   0       0      0    0        0
##    3         0       0        0    0   0   0       0      0    0        0
##    4         0       0        0    0   0   1       0      0    0        0
##    5         0       0        0    0   0   0       0      0    0        0
##     Terms
## Docs china collus congratul congress continu corrupt countri crime day
##    1     0      0         0        0       0       0       0     0   0
##    2     0      0         0        0       0       0       0     0   0
##    3     0      0         0        0       0       0       0     0   0
##    4     0      0         0        0       0       0       0     0   0
##    5     0      0         0        0       0       0       0     0   0
##     Terms
## Docs deal dem democrat dollar dont economi elect end fact fake fbi good
##    1    0   0        1      0    0       0     0   0    0    0   0    0
##    2    0   0        0      0    0       0     0   0    0    0   0    0
##    3    0   0        0      0    0       0     0   0    0    0   0    0
##    4    0   0        1      0    0       0     0   0    0    0   0    0
##    5    0   0        0      0    0       0     0   0    0    0   0    0
##     Terms
## Docs great happen hard high hillari histori honor hous hunt illeg immigr
##    1     0      0    0    0       0       0     0    0    0     0      0
##    2     0      0    0    0       0       0     0    0    1     0      0
##    3     0      0    0    0       0       0     0    0    0     0      0
##    4     0      0    0    0       0       0     0    0    0     0      0
##    5     0      0    0    0       0       0     0    0    0     0      0
##     Terms
## Docs impeach import includ job law long love made make media meet militari
##    1       0      0      0   0   0    0    0    0    0     0    0        0
##    2       1      0      0   0   0    0    0    0    0     0    0        0
##    3       0      0      0   0   0    0    0    0    0     0    0        0
##    4       0      0      0   0   0    0    0    0    0     0    0        0
##    5       0      0      0   0   0    0    0    0    0     0    0        0
##     Terms
## Docs mueller nation news north number obama parti peopl rate record report
##    1       0      0    0     0      0     0     0     0    0      0      0
##    2       0      0    0     0      0     0     0     0    0      0      0
##    3       0      0    0     0      0     0     0     0    0      0      0
##    4       0      0    0     0      0     0     0     0    0      0      0
##    5       0      0    0     0      0     0     0     0    0      0      0
##     Terms
## Docs republican russia schiff secur senat show start state stop stori
##    1          0      0      0     0     0    0     0     0    0     0
##    2          0      0      0     0     0    0     0     0    0     0
##    3          0      0      0     0     0    0     0     0    0     0
##    4          1      0      0     0     0    0     0     0    0     0
##    5          0      0      0     0     0    0     0     0    1     0
##     Terms
## Docs strong support talk tax thing time today total trade unit usa vote
##    1      0       0    0   0     0    0     0     0     0    0   0    0
##    2      0       0    0   0     0    0     0     0     0    0   0    0
##    3      0       0    0   0     1    0     0     0     0    0   0    0
##    4      0       0    0   0     0    0     1     0     0    0   0    0
##    5      0       0    0   0     0    0     0     0     0    0   0    0
##     Terms
## Docs wall watch win witch work world year
##    1    0     0   0     0    0     0    0
##    2    0     0   0     1    0     0    0
##    3    0     0   0     0    0     0    0
##    4    0     0   1     0    0     0    0
##    5    0     0   0     0    0     0    0

Data Exploration

In order to define our analytical approach we would like to understand the data gained, review initial insights about our data and make sure we do not require additional data in order to find the answer of our problem in question.

We can initially take a look at the top words within the tweets.

# visualization of top words within the complete tweets data

theme_set(theme_classic())


ggplot(top_words, aes(x=word, y=n))+
  geom_bar(stat="identity", width = 0.5, fill="tomato2")+
  xlab("Terms") + ylab("Count") + coord_flip() +
  theme(axis.text.x = element_text(angle=65, vjust=0.6, size=7))

wordcloud(tweetscorpus, max.words = 100, random.order = FALSE, rot.per = 0.15, min.freq = 5, colors = brewer.pal(8, "Dark2"))

There are some interesting finds here such as the top two words used within the tweets are “people” and “democrats”. Great is another word that is commonly used. However none of this top words analysis is very helpful to reach our business objective as they are not related to “Trade”.

To be more specific, we can take a look at words individually and review their relationship between them.

# which words are associated with 'trade'?
findAssocs(tweetsdtm.new, "trade", 0.05)
## $trade
##    deal billion   china countri  dollar    year    unit    talk    good 
##    0.25    0.20    0.19    0.13    0.13    0.12    0.10    0.08    0.06 
##     usa    long    meet 
##    0.06    0.05    0.05
# which words are associated with 'china'?
findAssocs(tweetsdtm.new, "china", 0.05)
## $china
## billion    deal   trade  dollar continu     usa    good    meet    make 
##    0.19    0.19    0.19    0.16    0.13    0.11    0.08    0.08    0.07 
##    year    unit   start 
##    0.07    0.06    0.05
# which words are associated with 'job'?
findAssocs(tweetsdtm.new, "job", 0.05)
## $job
##    great militari  economi   record   number      tax 
##     0.13     0.09     0.07     0.07     0.06     0.06

We can see “trade” has associations with multiple words such as deal, billion and china and text “job” has associations with great, militari and economi.

freq_terms <- findFreqTerms(tweetsdtm.new, lowfreq = 500)  

#visualizing the association

plot(tweetsdtm.new, term = freq_terms, corThreshold = 0.10, weighting = T)

We can also see the association between words such as “news” and “fake”, “great”, “jobs” and “state” are commonly used together.

We should also look at how stock market has been trending within our target date range.

ggplot(stocks.df, aes(x=Date))+
  geom_line(aes(y=Open))+
  labs(title = "Stock Market Trend")+
  theme(axis.text.x = element_text(angle=90, vjust=0.5),
        panel.grid.minor = element_blank())

We can see that starting from 2019-01, the stock market is trending upwards.

Model Development

Based on our business objective and the data we have prepared, we decided to proceed with topic modeling as our analytical approach for model development. The idea is for us to identify topics as set of documents, select the right topic and create a final stock market dataframe for prediction. In terms of topic modeling, we have selected Latent Dirichlet Allocation(LDA).

LDA Model

LDA is an unsupervised learning that views the documents as bag of words. In each topic that is generated, picks a set of words against it. Below outlines the each step the LDA does;

  • Assume there are k topics across all the documents.

  • Distribute these topics across a dopcument by assigning each word a topic.

  • For each word in the document, assume its topic is wrong but every otehr word is assignet the topic is correct.

  • Assign a word for each topic based on what topics are in the document and how many times a word has been assigned to a particular topic accross all of the documents.

  • Repeat this process a number of times for each document.

We will use LDA for 30 topics.

# Create a LDA model with Gibbs method for 30 topics.
tweetsLDA <- LDA(tweetsdtm.matrix, 30, method="Gibbs", control = list(seed = 123))

# Top 30 words per topic.
terms(tweetsLDA, 30)
##       Topic 1     Topic 2    Topic 3     Topic 4     Topic 5     
##  [1,] "work"      "democrat" "russia"    "big"       "today"     
##  [2,] "hard"      "dont"     "campaign"  "win"       "nation"    
##  [3,] "great"     "bad"      "fbi"       "congratul" "honor"     
##  [4,] "peopl"     "fact"     "hillari"   "great"     "great"     
##  [5,] "continu"   "happen"   "collus"    "includ"    "record"    
##  [6,] "number"    "total"    "elect"     "dem"       "law"       
##  [7,] "dem"       "watch"    "report"    "dollar"    "big"       
##  [8,] "start"     "media"    "total"     "american"  "congress"  
##  [9,] "parti"     "militari" "work"      "world"     "dont"      
## [10,] "high"      "today"    "watch"     "campaign"  "talk"      
## [11,] "countri"   "world"    "stop"      "hard"      "republican"
## [12,] "today"     "hunt"     "end"       "strong"    "border"    
## [13,] "fbi"       "corrupt"  "show"      "show"      "corrupt"   
## [14,] "call"      "fbi"      "law"       "vote"      "import"    
## [15,] "histori"   "hillari"  "state"     "year"      "american"  
## [16,] "stop"      "time"     "big"       "collus"    "china"     
## [17,] "thing"     "china"    "hunt"      "happen"    "dem"       
## [18,] "administr" "strong"   "made"      "honor"     "impeach"   
## [19,] "back"      "talk"     "rate"      "meet"      "militari"  
## [20,] "day"       "vote"     "unit"      "russia"    "obama"     
## [21,] "fake"      "american" "administr" "state"     "senat"     
## [22,] "rate"      "collus"   "america"   "thing"     "stop"      
## [23,] "schiff"    "economi"  "american"  "today"     "strong"    
## [24,] "america"   "end"      "back"      "administr" "unit"      
## [25,] "american"  "number"   "bad"       "america"   "administr" 
## [26,] "collus"    "obama"    "billion"   "back"      "america"   
## [27,] "deal"      "peopl"    "border"    "bad"       "back"      
## [28,] "dont"      "senat"    "call"      "billion"   "bad"       
## [29,] "hillari"   "stop"     "china"     "border"    "billion"   
## [30,] "illeg"     "unit"     "congratul" "call"      "call"      
##       Topic 6     Topic 7    Topic 8      Topic 9     Topic 10    
##  [1,] "back"      "economi"  "hous"       "year"      "meet"      
##  [2,] "great"     "tax"      "senat"      "obama"     "north"     
##  [3,] "day"       "high"     "great"      "administr" "talk"      
##  [4,] "bad"       "number"   "includ"     "american"  "import"    
##  [5,] "includ"    "record"   "record"     "crime"     "continu"   
##  [6,] "happen"    "countri"  "big"        "support"   "end"       
##  [7,] "hous"      "crime"    "end"        "militari"  "long"      
##  [8,] "number"    "parti"    "impeach"    "day"       "show"      
##  [9,] "collus"    "big"      "stop"       "corrupt"   "state"     
## [10,] "america"   "great"    "illeg"      "vote"      "good"      
## [11,] "long"      "dem"      "congress"   "senat"     "call"      
## [12,] "fact"      "america"  "dont"       "economi"   "russia"    
## [13,] "histori"   "militari" "hillari"    "made"      "stop"      
## [14,] "job"       "talk"     "bad"        "north"     "fbi"       
## [15,] "deal"      "vote"     "countri"    "fake"      "happen"    
## [16,] "elect"     "watch"    "hard"       "long"      "hard"      
## [17,] "fbi"       "collus"   "love"       "stori"     "usa"       
## [18,] "impeach"   "day"      "win"        "big"       "love"      
## [19,] "make"      "long"     "parti"      "elect"     "back"      
## [20,] "obama"     "obama"    "republican" "hous"      "border"    
## [21,] "schiff"    "peopl"    "support"    "america"   "dont"      
## [22,] "strong"    "state"    "year"       "back"      "hunt"      
## [23,] "wall"      "hard"     "administr"  "bad"       "republican"
## [24,] "watch"     "histori"  "america"    "billion"   "support"   
## [25,] "administr" "illeg"    "american"   "border"    "administr" 
## [26,] "american"  "news"     "back"       "call"      "america"   
## [27,] "big"       "report"   "billion"    "campaign"  "american"  
## [28,] "billion"   "today"    "border"     "china"     "bad"       
## [29,] "border"    "usa"      "call"       "collus"    "big"       
## [30,] "call"      "wall"     "campaign"   "congratul" "billion"   
##       Topic 11    Topic 12    Topic 13     Topic 14    Topic 15   
##  [1,] "state"     "crime"     "american"   "time"      "stori"    
##  [2,] "unit"      "strong"    "peopl"      "long"      "media"    
##  [3,] "dont"      "militari"  "histori"    "start"     "news"     
##  [4,] "continu"   "border"    "made"       "countri"   "corrupt"  
##  [5,] "talk"      "love"      "fact"       "year"      "fake"     
##  [6,] "watch"     "peopl"     "great"      "bad"       "fact"     
##  [7,] "news"      "tax"       "day"        "deal"      "bad"      
##  [8,] "great"     "long"      "dem"        "work"      "state"    
##  [9,] "nation"    "total"     "report"     "make"      "report"   
## [10,] "congress"  "world"     "stop"       "record"    "total"    
## [11,] "obama"     "start"     "includ"     "trade"     "time"     
## [12,] "strong"    "wall"      "happen"     "includ"    "job"      
## [13,] "american"  "countri"   "work"       "secur"     "rate"     
## [14,] "day"       "elect"     "china"      "end"       "big"      
## [15,] "long"      "good"      "dont"       "campaign"  "dont"     
## [16,] "mueller"   "win"       "good"       "stop"      "american" 
## [17,] "start"     "american"  "hous"       "high"      "continu"  
## [18,] "support"   "call"      "news"       "total"     "countri"  
## [19,] "administr" "dollar"    "republican" "russia"    "end"      
## [20,] "america"   "economi"   "schiff"     "congress"  "fbi"      
## [21,] "back"      "fact"      "show"       "histori"   "law"      
## [22,] "bad"       "fake"      "strong"     "import"    "made"     
## [23,] "big"       "high"      "today"      "media"     "mueller"  
## [24,] "billion"   "honor"     "unit"       "rate"      "record"   
## [25,] "border"    "hunt"      "administr"  "usa"       "start"    
## [26,] "call"      "made"      "america"    "congratul" "stop"     
## [27,] "campaign"  "administr" "back"       "crime"     "administr"
## [28,] "china"     "america"   "bad"        "day"       "america"  
## [29,] "collus"    "back"      "big"        "meet"      "back"     
## [30,] "congratul" "bad"       "billion"    "north"     "billion"  
##       Topic 16    Topic 17    Topic 18     Topic 19    Topic 20   
##  [1,] "great"     "good"      "rate"       "job"       "democrat" 
##  [2,] "total"     "thing"     "elect"      "great"     "impeach"  
##  [3,] "support"   "happen"    "bad"        "world"     "dem"      
##  [4,] "love"      "dem"       "show"       "back"      "schiff"   
##  [5,] "show"      "usa"       "watch"      "deal"      "fact"     
##  [6,] "strong"    "north"     "great"      "campaign"  "report"   
##  [7,] "big"       "time"      "call"       "thing"     "world"    
##  [8,] "fbi"       "world"     "dont"       "high"      "start"    
##  [9,] "win"       "great"     "end"        "senat"     "end"      
## [10,] "corrupt"   "work"      "make"       "dollar"    "crime"    
## [11,] "histori"   "big"       "militari"   "happen"    "hard"     
## [12,] "work"      "obama"     "time"       "obama"     "administr"
## [13,] "meet"      "administr" "year"       "america"   "dollar"   
## [14,] "happen"    "includ"    "collus"     "big"       "love"     
## [15,] "make"      "meet"      "republican" "call"      "meet"     
## [16,] "today"     "talk"      "talk"       "economi"   "stop"     
## [17,] "american"  "fact"      "long"       "includ"    "congress" 
## [18,] "border"    "china"     "nation"     "long"      "militari" 
## [19,] "call"      "corrupt"   "usa"        "parti"     "number"   
## [20,] "hunt"      "elect"     "crime"      "state"     "witch"    
## [21,] "media"     "state"     "start"      "strong"    "america"  
## [22,] "news"      "strong"    "administr"  "today"     "american" 
## [23,] "number"    "democrat"  "america"    "unit"      "back"     
## [24,] "administr" "economi"   "american"   "administr" "bad"      
## [25,] "america"   "fake"      "back"       "american"  "big"      
## [26,] "back"      "news"      "big"        "bad"       "billion"  
## [27,] "bad"       "report"    "billion"    "billion"   "border"   
## [28,] "billion"   "america"   "border"     "border"    "call"     
## [29,] "campaign"  "american"  "campaign"   "china"     "campaign" 
## [30,] "china"     "back"      "china"      "collus"    "china"    
##       Topic 21    Topic 22    Topic 23     Topic 24     Topic 25    
##  [1,] "call"      "make"      "border"     "vote"       "countri"   
##  [2,] "made"      "america"   "wall"       "republican" "histori"   
##  [3,] "congress"  "deal"      "secur"      "parti"      "world"     
##  [4,] "schiff"    "back"      "vote"       "big"        "usa"       
##  [5,] "end"       "good"      "schiff"     "elect"      "trade"     
##  [6,] "deal"      "includ"    "hard"       "countri"    "make"      
##  [7,] "happen"    "continu"   "world"      "crime"      "democrat"  
##  [8,] "corrupt"   "high"      "deal"       "impeach"    "includ"    
##  [9,] "year"      "start"     "high"       "bad"        "republican"
## [10,] "today"     "bad"       "make"       "import"     "end"       
## [11,] "import"    "dont"      "nation"     "media"      "back"      
## [12,] "high"      "job"       "republican" "russia"     "day"       
## [13,] "support"   "year"      "today"      "trade"      "peopl"     
## [14,] "campaign"  "great"     "administr"  "end"        "thing"     
## [15,] "collus"    "meet"      "america"    "stori"      "long"      
## [16,] "countri"   "show"      "american"   "talk"       "total"     
## [17,] "economi"   "trade"     "back"       "back"       "news"      
## [18,] "illeg"     "unit"      "bad"        "democrat"   "parti"     
## [19,] "immigr"    "administr" "big"        "happen"     "state"     
## [20,] "senat"     "american"  "billion"    "hous"       "american"  
## [21,] "trade"     "big"       "call"       "illeg"      "border"    
## [22,] "watch"     "billion"   "campaign"   "meet"       "congratul" 
## [23,] "administr" "border"    "china"      "administr"  "hous"      
## [24,] "america"   "call"      "collus"     "america"    "import"    
## [25,] "american"  "campaign"  "congratul"  "american"   "meet"      
## [26,] "back"      "china"     "congress"   "billion"    "report"    
## [27,] "bad"       "collus"    "continu"    "border"     "administr" 
## [28,] "big"       "congratul" "corrupt"    "call"       "america"   
## [29,] "billion"   "congress"  "countri"    "campaign"   "bad"       
## [30,] "border"    "corrupt"   "crime"      "china"      "big"       
##       Topic 26    Topic 27     Topic 28     Topic 29    Topic 30   
##  [1,] "china"     "law"        "peopl"      "witch"     "news"     
##  [2,] "trade"     "illeg"      "countri"    "hunt"      "fake"     
##  [3,] "dollar"    "immigr"     "great"      "mueller"   "media"    
##  [4,] "billion"   "stop"       "start"      "report"    "fact"     
##  [5,] "deal"      "democrat"   "day"        "collus"    "militari" 
##  [6,] "continu"   "end"        "american"   "media"     "report"   
##  [7,] "peopl"     "work"       "dem"        "today"     "nation"   
##  [8,] "usa"       "continu"    "end"        "corrupt"   "world"    
##  [9,] "job"       "deal"       "made"       "total"     "strong"   
## [10,] "fact"      "peopl"      "import"     "china"     "wall"     
## [11,] "good"      "total"      "thing"      "support"   "border"   
## [12,] "happen"    "dollar"     "total"      "work"      "call"     
## [13,] "senat"     "campaign"   "watch"      "administr" "dollar"   
## [14,] "stop"      "republican" "stop"       "call"      "hous"     
## [15,] "today"     "corrupt"    "talk"       "crime"     "talk"     
## [16,] "administr" "import"     "administr"  "happen"    "trade"    
## [17,] "america"   "media"      "fact"       "import"    "administr"
## [18,] "american"  "support"    "fbi"        "obama"     "america"  
## [19,] "back"      "american"   "good"       "secur"     "american" 
## [20,] "bad"       "call"       "republican" "america"   "back"     
## [21,] "big"       "china"      "russia"     "american"  "bad"      
## [22,] "border"    "hillari"    "unit"       "back"      "big"      
## [23,] "call"      "report"     "border"     "bad"       "billion"  
## [24,] "campaign"  "secur"      "fake"       "big"       "campaign" 
## [25,] "collus"    "strong"     "news"       "billion"   "china"    
## [26,] "congratul" "trade"      "number"     "border"    "collus"   
## [27,] "congress"  "witch"      "tax"        "campaign"  "congratul"
## [28,] "corrupt"   "administr"  "witch"      "congratul" "congress" 
## [29,] "countri"   "america"    "work"       "congress"  "continu"  
## [30,] "crime"     "back"       "america"    "continu"   "corrupt"

Per-Document Classification

# Per-topic-per-word probabilities.
tweetsLDA.topicword.prob <- tidy(tweetsLDA, matrix="beta")
head(tweetsLDA.topicword.prob)
## # A tibble: 6 x 3
##   topic term           beta
##   <int> <chr>         <dbl>
## 1     1 administr 0.00184  
## 2     2 administr 0.0000909
## 3     3 administr 0.0000798
## 4     4 administr 0.0000814
## 5     5 administr 0.0000876
## 6     6 administr 0.0000938
# Find the 10 terms that are most common within each topic.
tweetsLDA.topterms <- tweetsLDA.topicword.prob %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

head(tweetsLDA.topterms)
## # A tibble: 6 x 3
##   topic term      beta
##   <int> <chr>    <dbl>
## 1     1 work    0.447 
## 2     1 hard    0.182 
## 3     1 great   0.0780
## 4     1 peopl   0.0622
## 5     1 continu 0.0509
## 6     1 number  0.0465
# Plot per-topic-per-word probabilities for topic #26.
tweetsLDA.topterms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  filter(topic==26) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

# Classify the selected topic per document.
tweetsLDA.class <- data.frame(topics(tweetsLDA))
tweetsLDA.class <- cbind(tweetsLDA.class, 1:nrow(tweetsLDA.class))
colnames(tweetsLDA.class)[ncol(tweetsLDA.class)] <-'doc_id'
tweetsLDA.class <- tweetsLDA.class %>% filter(topics.tweetsLDA.==26)

head(tweetsLDA.class)
##   topics.tweetsLDA. doc_id
## 1                26     61
## 2                26     62
## 3                26    512
## 4                26    719
## 5                26    786
## 6                26    865
# Inner join selected classified topic with original dataframe.
tweets.final <- inner_join(tweetsLDA.class, tweets_slc)
## Joining, by = "doc_id"
head(tweets.final)
##   topics.tweetsLDA. doc_id
## 1                26     61
## 2                26     62
## 3                26    512
## 4                26    719
## 5                26    786
## 6                26    865
##                                                                                                                                                                                                                                                                         text
## 1                                                                                                                                                                                i am struck by schiffs attempt to characterize  s conversation with  zelensky as a demand i
## 2                                                                                                                                                     we all have the transcript of the call schiff is asking vindman to describe for us we can all read it because the whit
## 3                                                                                                                                                          there is serious work to get done on behalf of this countryand congressional democrats are blatantly ignoring it 
## 4 general michael flynns attorney is demanding that charges be immediately dropped after they found that fbi agents manipulated records against him they say that james clapper told a reporter to take a kill shot at flynn this has been a complete setup of michael flynn
## 5                                                                                                                                                    democrats are trying to deny republican members of congress access to schiffs secret impeachment proceedings what are t
## 6                  doral in miami would have been the best place to hold the gand free but too much heat from the do nothing radical left democrats their partner the fake news media im surprised that they allow me to give up my plus ial salary well find someplace else
##         date hour
## 1 11/19/2019   18
## 2 11/19/2019   18
## 3  11/2/2019   21
## 4 10/26/2019   11
## 5 10/23/2019   17
## 6 10/21/2019   13

Based on the probability per topic, per word , we can see that “china”, “trade”, “dollar”, “billion” and “deal” has the highest probability in the topic we chose.

Sentiment Analysis

# Turn tweets text into vector.
tweets.df <- as.vector(tweets.final$text)

# Getting emotion score for each tweet.
tweets.emotion <- get_nrc_sentiment(tweets.df)
tweets.emotion <- cbind(tweets.final, tweets.emotion) 
head(tweets.emotion)
##   topics.tweetsLDA. doc_id
## 1                26     61
## 2                26     62
## 3                26    512
## 4                26    719
## 5                26    786
## 6                26    865
##                                                                                                                                                                                                                                                                         text
## 1                                                                                                                                                                                i am struck by schiffs attempt to characterize  s conversation with  zelensky as a demand i
## 2                                                                                                                                                     we all have the transcript of the call schiff is asking vindman to describe for us we can all read it because the whit
## 3                                                                                                                                                          there is serious work to get done on behalf of this countryand congressional democrats are blatantly ignoring it 
## 4 general michael flynns attorney is demanding that charges be immediately dropped after they found that fbi agents manipulated records against him they say that james clapper told a reporter to take a kill shot at flynn this has been a complete setup of michael flynn
## 5                                                                                                                                                    democrats are trying to deny republican members of congress access to schiffs secret impeachment proceedings what are t
## 6                  doral in miami would have been the best place to hold the gand free but too much heat from the do nothing radical left democrats their partner the fake news media im surprised that they allow me to give up my plus ial salary well find someplace else
##         date hour anger anticipation disgust fear joy sadness surprise
## 1 11/19/2019   18     1            1       0    0   0       0        0
## 2 11/19/2019   18     0            0       0    0   0       0        0
## 3  11/2/2019   21     0            0       0    0   0       0        0
## 4 10/26/2019   11     2            1       0    3   1       2        1
## 5 10/23/2019   17     1            0       1    0   0       0        0
## 6 10/21/2019   13     0            1       0    0   1       0        1
##   trust negative positive
## 1     0        1        0
## 2     1        0        0
## 3     0        0        0
## 4     4        4        5
## 5     2        2        0
## 6     1        1        2
# Getting sentiment score for each tweet.
tweets.score <- get_sentiment(tweets.df)
tweets.score <- cbind(tweets.final,tweets.score )
head(tweets.score)
##   topics.tweetsLDA. doc_id
## 1                26     61
## 2                26     62
## 3                26    512
## 4                26    719
## 5                26    786
## 6                26    865
##                                                                                                                                                                                                                                                                         text
## 1                                                                                                                                                                                i am struck by schiffs attempt to characterize  s conversation with  zelensky as a demand i
## 2                                                                                                                                                     we all have the transcript of the call schiff is asking vindman to describe for us we can all read it because the whit
## 3                                                                                                                                                          there is serious work to get done on behalf of this countryand congressional democrats are blatantly ignoring it 
## 4 general michael flynns attorney is demanding that charges be immediately dropped after they found that fbi agents manipulated records against him they say that james clapper told a reporter to take a kill shot at flynn this has been a complete setup of michael flynn
## 5                                                                                                                                                    democrats are trying to deny republican members of congress access to schiffs secret impeachment proceedings what are t
## 6                  doral in miami would have been the best place to hold the gand free but too much heat from the do nothing radical left democrats their partner the fake news media im surprised that they allow me to give up my plus ial salary well find someplace else
##         date hour tweets.score
## 1 11/19/2019   18        -1.00
## 2 11/19/2019   18         0.00
## 3  11/2/2019   21         0.00
## 4 10/26/2019   11        -1.50
## 5 10/23/2019   17        -1.55
## 6 10/21/2019   13         2.10

We have defined the topics in sets of documents using LDA, we have also assigned a tweet score with our sentiment analysis. Our next step is to map the sentiment scores against the stock market change.

Sentiment Scores vs Stock Price Change

In order to map the sentiment scores, we need to merge the stocks and stocks dataframe together.

# Update column name.
colnames(tweets.score)[4]<-"Date"

# Aggregate scores into single day.
tweets.score.sum <- tweets.score %>% 
  select(Date, tweets.score) %>% 
  group_by(Date) %>%
  summarise(scores=sum(tweets.score))

# Update date column into date format.
tweets.score.sum$Date <- anydate(tweets.score.sum$Date)
  
# Merge stocks dataframe and scores dataframe.
stocks.df.new <-  stocks.df %>% select(Date, Pct_Change)
stocks.scores <- merge(stocks.df.new,tweets.score.sum, by='Date')

head(stocks.scores)
##         Date  Pct_Change scores
## 1 2018-05-25 -0.08334630  -1.75
## 2 2018-05-29 -0.56374785   1.30
## 3 2018-06-05  0.01237377  -1.40
## 4 2018-06-06  0.69372916   0.65
## 5 2018-06-25 -0.94314398   7.95
## 6 2018-06-26  0.03452978   0.95

When we look at our combined stocks and scores dataframe, we are able to see the percentage change of stock market for a given date and it is sentiment score.

## Compare stocks price percentage change with sentiment score.

# Two variables on same y-axis.
ggplot(stocks.scores, aes(Date)) + ggtitle("Stocks Price Change vs Sentiment Scores") + ylab("") +  geom_line(aes(y=Pct_Change, group=1, colour="Stock Price Change")) + geom_line(aes(y=scores, group=2, colour="Sentiment Scores")) + theme(plot.title = element_text(hjust=0.5), axis.title.x=element_blank(), axis.text.x=element_text(angle=90,hjust=1), legend.position=c(0.5,0.9),legend.title=element_blank())

# Each variable on different y-axis with geom_line.
ggplot(stocks.scores,aes(x=Date)) +  geom_line(aes(y=scores, colour="Sentiment Scores")) + geom_line(aes(y=Pct_Change*10, colour="Stock Price Change")) + scale_y_continuous(sec.axis = sec_axis(~ ./100 , name = "%")) + scale_colour_manual(values=c("blue","red")) + labs(y="Scores", x="Date", colour="Parameter") + theme(legend.position=c(0.87,0.885))

# Each variable on different y-axis with geom_line and geom_smooth.
ggplot(stocks.scores,aes(x=Date)) +  geom_line(aes(y=scores, colour="Sentiment Scores")) + geom_smooth(aes(y=Pct_Change*10, colour="Stock Price Change")) + scale_y_continuous(sec.axis = sec_axis(~ ./100 , name = "%")) + scale_colour_manual(values=c("blue","red")) + labs(y="Scores", x="Date", colour="Parameter") + theme(legend.position=c(0.87,0.885))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Linear Model Regression. Checking to see any linear relationship between sentiment scores and stock price change.
stocks.scores.lm <- lm(Pct_Change~scores, data=stocks.scores)
summary(stocks.scores.lm)
## 
## Call:
## lm(formula = Pct_Change ~ scores, data = stocks.scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9426 -0.3251  0.1289  0.4568  1.9250 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.002823   0.083193  -0.034    0.973
## scores      -0.029931   0.034463  -0.869    0.387
## 
## Residual standard error: 0.8041 on 105 degrees of freedom
## Multiple R-squared:  0.007133,   Adjusted R-squared:  -0.002323 
## F-statistic: 0.7543 on 1 and 105 DF,  p-value: 0.3871
plot(x = stocks.scores$scores, y = stocks.scores$Pct_Change)
abline(stocks.scores.lm)

theme_set(theme_bw())
ggplot(stocks.scores, aes(scores, Pct_Change))+
  geom_smooth(method = "lm", se=F)

Conclusion

  • Top 5 words that are used with the topic that has the most impact on stock market price change are, “china”, “trade”, “dollar”, “billion” and “deal”

  • When “trade” word is used in a tweet, it is common that words “such as deal”, “billion” , “china”, “countri”, “dollar”, “year”, “unit”, “talk”, “good”, “usa”, “long” and “meet” are used as well.

  • There is a slight negative linear correlation between the sentiment scores and Pct_change on stock market.

  • Overall Sentiment Scores flactuates more than the Percentage Change in Stock Market.

Appendix

In this section, we included additional approaches we have executed along the way. You might consider these as different iterations of the project/output.

ITERATION 2

ITERATION 1

References