The project aims to model financial sentiment– using sentiment analysis and text mining– to predict stock price fluctuations. Leveraging sentiment analysis, the model will systematically process the valence of words in various companies’ annual financial reports from 3 sections - Risk Factors, Management’s Discussion, and Auditor’s comments.
Packages Used:
library(readxl) #read excel
library(tidyverse) #data cleaning
library(tidytext) #text mining
library(textstem) #text mining
library(caTools) #logistic regression
library(rpart); library(rpart.plot) #decision tree
## [1] "AMZN" "HD" "MCD" "NKE" "SBUX" "LOW" "TJX" "BKNG" "TGT" "F"
## Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of 7 variables:
## $ TICKER : chr "AMZN" "HD" "MCD" "NKE" ...
## $ RISK FACTORS : chr "Please carefully consider the following discussion of significant factors, events, and uncertainties that make "| __truncated__ "Our business, results of operations, and financial condition are subject to numerous risks and uncertainties. I"| __truncated__ "If we do not successfully evolve and execute against our business strategies, including under the Velocity Grow"| __truncated__ "Special Note Regarding Forward-Looking Statements and Analyst Reports\r\nCertain written and oral statements, o"| __truncated__ ...
## $ MANAGEMENT'S DISCUSSION: chr "Forward-Looking Statements\r\nThis Annual Report on Form 10-K includes forward-looking statements within the me"| __truncated__ "We reported net sales of $108.2 billion in fiscal 2018 . Net earnings were $11.1 billion , or $9.73 per diluted"| __truncated__ "MANAGEMENT'S VIEW OF THE BUSINESS\r\nIn analyzing business trends, management reviews results on a constant cur"| __truncated__ "NIKE designs, develops, markets and sells athletic footwear, apparel, equipment, accessories and services world"| __truncated__ ...
## $ AUDITORS COMMENTS : chr "Opinion on the Financial Statements\r\nWe have audited the accompanying consolidated balance sheets of Amazon.c"| __truncated__ "Opinion on the Consolidated Financial Statements\r\nWe have audited the accompanying Consolidated Balance Sheet"| __truncated__ "Opinion on Internal Control over Financial Reporting\r\n\r\nWe have audited McDonald’s Corporation’s internal c"| __truncated__ "Opinions on the Financial Statements and Internal Control over Financial Reporting\r\nWe have audited the accom"| __truncated__ ...
## $ STOCK PRICE PRE : num 1870.7 191.9 210.1 86.7 84.2 ...
## $ STOCK PRICE POST : num 2008.7 195.6 201 87.3 84 ...
## $ Movement : Factor w/ 2 levels "decrease","increase": 2 2 1 2 1 2 2 1 1 1
project$`RISK FACTORS`<- tolower(project$`RISK FACTORS`)
project$`MANAGEMENT'S DISCUSSION`<- tolower(project$`MANAGEMENT'S DISCUSSION`)
project$`AUDITORS COMMENTS`<- tolower(project$`AUDITORS COMMENTS`)
project$AC_stem <- project$`AUDITORS COMMENTS` %>% stem_strings()
project$MD_stem <- project$`MANAGEMENT'S DISCUSSION` %>% stem_strings()
project$RF_stem <- project$`RISK FACTORS` %>% stem_strings()
This section of code sets up the section variable dataframes for analysis by utilizing the sentiment analyzer lexicon - Loughran. The 6 sentiment categories are Constraining, Litigious, Positive, Negative, Superfluous, and Uncertainty.
RFstem_loughran <- project %>%
unnest_tokens(word, `RF_stem`) %>%
anti_join(stop_words) %>%
count(TICKER,`STOCK PRICE PRE`,`STOCK PRICE POST`,Movement, word, sort = TRUE) %>%
ungroup() %>%
inner_join(get_sentiments("loughran"), by = "word");
MDstem_loughran <- project %>%
unnest_tokens(word, `MD_stem`) %>%
anti_join(stop_words) %>%
count(TICKER,`STOCK PRICE PRE`,`STOCK PRICE POST`,Movement, word, sort = TRUE) %>%
ungroup() %>%
inner_join(get_sentiments("loughran"), by = "word")
ACstem_loughran <- project %>%
unnest_tokens(word, `AC_stem`) %>%
anti_join(stop_words) %>%
count(TICKER,`STOCK PRICE PRE`,`STOCK PRICE POST`,Movement, word, sort = TRUE) %>%
ungroup() %>%
inner_join(get_sentiments("loughran"), by = "word")
Risk Factors:
## # A tibble: 466 x 7
## TICKER `STOCK PRICE PRE` `STOCK PRICE PO… Movement word n sentiment
## <chr> <dbl> <dbl> <fct> <chr> <int> <chr>
## 1 F 8.31 8.25 decrease risk 35 uncertai…
## 2 AMZN 1871. 2009. increase risk 31 uncertai…
## 3 NKE 86.7 87.3 increase risk 24 uncertai…
## 4 LOW 109. 110. increase risk 23 uncertai…
## 5 MCD 210. 201 decrease succ… 23 positive
## 6 TJX 53.2 54.1 increase risk 22 uncertai…
## 7 MCD 210. 201 decrease depe… 21 uncertai…
## 8 MCD 210. 201 decrease depe… 21 constrai…
## 9 HD 192. 196. increase risk 18 uncertai…
## 10 MCD 210. 201 decrease risk 18 uncertai…
## # … with 456 more rows
Management’s Discussion:
## # A tibble: 114 x 7
## TICKER `STOCK PRICE PRE` `STOCK PRICE PO… Movement word n sentiment
## <chr> <dbl> <dbl> <fct> <chr> <int> <chr>
## 1 BKNG 1678. 1660. decrease risk 11 uncertai…
## 2 HD 192. 196. increase bene… 10 positive
## 3 AMZN 1871. 2009. increase risk 7 uncertai…
## 4 BKNG 1678. 1660. decrease canc… 7 negative
## 5 TJX 53.2 54.1 increase bene… 7 positive
## 6 BKNG 1678. 1660. decrease slow 6 negative
## 7 F 8.31 8.25 decrease loss 6 negative
## 8 HD 192. 196. increase loss 5 negative
## 9 BKNG 1678. 1660. decrease bene… 4 positive
## 10 BKNG 1678. 1660. decrease conc… 4 negative
## # … with 104 more rows
Auditor’s Comments:
## # A tibble: 76 x 7
## TICKER `STOCK PRICE PRE` `STOCK PRICE PO… Movement word n sentiment
## <chr> <dbl> <dbl> <fct> <chr> <int> <chr>
## 1 F 8.31 8.25 decrease bene… 25 positive
## 2 F 8.31 8.25 decrease loss 21 negative
## 3 F 8.31 8.25 decrease risk 5 uncertai…
## 4 NKE 86.7 87.3 increase risk 5 uncertai…
## 5 TJX 53.2 54.1 increase risk 5 uncertai…
## 6 AMZN 1871. 2009. increase bene… 4 positive
## 7 F 8.31 8.25 decrease claim 4 litigious
## 8 F 8.31 8.25 decrease sever 4 negative
## 9 AMZN 1871. 2009. increase law 3 litigious
## 10 BKNG 1678. 1660. decrease risk 3 uncertai…
## # … with 66 more rows
allstem<-bind_rows(list(ACstem_loughran,MDstem_loughran,RFstem_loughran))
allstem <- allstem %>% group_by(TICKER,`STOCK PRICE PRE`,`STOCK PRICE POST`,Movement, sentiment,word) %>% summarise(sum(n))
colnames(allstem)[which(names(allstem) == "sum(n)")] <- "word_count"
allstem
## # A tibble: 519 x 7
## # Groups: TICKER, STOCK PRICE PRE, STOCK PRICE POST, Movement, sentiment
## # [51]
## TICKER `STOCK PRICE PR… `STOCK PRICE PO… Movement sentiment word
## <chr> <dbl> <dbl> <fct> <chr> <chr>
## 1 AMZN 1871. 2009. increase constrai… comm…
## 2 AMZN 1871. 2009. increase constrai… depe…
## 3 AMZN 1871. 2009. increase constrai… impa…
## 4 AMZN 1871. 2009. increase constrai… limit
## 5 AMZN 1871. 2009. increase constrai… prev…
## 6 AMZN 1871. 2009. increase constrai… proh…
## 7 AMZN 1871. 2009. increase constrai… rest…
## 8 AMZN 1871. 2009. increase litigious amend
## 9 AMZN 1871. 2009. increase litigious brea…
## 10 AMZN 1871. 2009. increase litigious claim
## # … with 509 more rows, and 1 more variable: word_count <int>
set.seed(617)
split = sample.split(Y = allstem$Movement, SplitRatio = 0.8)
train = allstem[split,]
test = allstem[!split,]
trainwords=unique(train$word)
testwords = unique(test$word)
wwtest = c()
for (i in testwords){
if (!(i %in% trainwords)){
wwtest = cbind(wwtest,i)
}
}
wwtest
## i i i i i i i
## [1,] "break" "spam" "complaint" "constrain" "entail" "therefrom" "default"
## i i i i i i
## [1,] "misconduct" "stringent" "reinterpret" "thereof" "poor" "wrong"
testunique = c("break","spam","complaint","constrain","entail","therefrom","default","misconduct","stringent","reinterpret","thereof","poor","wrong")
testsubset = subset(test, ! word %in% testunique)
model1 = glm(Movement~sentiment+word_count,data=train,family='binomial')
summary(model1)
##
## Call:
## glm(formula = Movement ~ sentiment + word_count, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.197 -1.186 -1.064 1.164 1.311
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.020833 0.291952 -0.071 0.943
## sentimentlitigious -0.211310 0.387285 -0.546 0.585
## sentimentnegative 0.070159 0.313043 0.224 0.823
## sentimentpositive -0.092569 0.381448 -0.243 0.808
## sentimentsuperfluous -13.540767 535.411245 -0.025 0.980
## sentimentuncertainty 0.067989 0.469148 0.145 0.885
## word_count -0.004466 0.019989 -0.223 0.823
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 575.12 on 414 degrees of freedom
## Residual deviance: 572.65 on 408 degrees of freedom
## AIC: 586.65
##
## Number of Fisher Scoring iterations: 12
pred1 = predict(model1,type="response")
ct1 = table(train$Movement,pred1>0.5); ct1
##
## FALSE TRUE
## decrease 99 113
## increase 85 118
Accuracy rate of train data: 52.29%
pred1Test = predict(model1,newdata=test,type="response")
ct1Test = table(test$Movement,pred1Test>0.5); ct1Test
##
## FALSE TRUE
## decrease 24 29
## increase 21 30
Accuracy rate of test data: 51.92%
model2 = glm(Movement~sentiment+word+word_count,data=train,family='binomial')
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model2)
##
## Call:
## glm(formula = Movement ~ sentiment + word + word_count, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.897 -1.004 0.000 1.044 8.490
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.274e+14 1.350e+14 1.684e+00 0.0921 .
## sentimentlitigious -1.054e+00 1.679e+00 -6.270e-01 0.5304
## sentimentnegative -2.202e+00 1.078e+00 -2.042e+00 0.0412 *
## sentimentpositive 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## sentimentsuperfluous -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## sentimentuncertainty -2.481e-01 1.056e+00 -2.350e-01 0.8143
## wordaftermath -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordamend -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordappeal -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordbad 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordbarrier -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordbenefit -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordboycott -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordbreach -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordburden -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordcancel -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordclaim -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordclawback -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordcommit -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordcompel -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordconcern -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordconflict -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordconsent -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordconstraint -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordcontract -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordcorrupt -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordcounterfeit -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordcourt -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordcrime -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordcurtail 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordcut -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddefect -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddefend -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddenial -2.273e+14 1.350e+14 -1.684e+00 0.0923 .
## worddepend -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddeter 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## worddifficult -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddiminish -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddisproportion 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## worddisrupt -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddistract -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddistress -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## worddivert -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## worddivest -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worddownturn -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## worddownward -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## worddrought -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordeasier -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## worderror -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordexploit 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordfail -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordfault 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordflaw -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordforego -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordfraud -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordgain -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordharass -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordharm -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordhinder -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordhonor -2.029e+06 9.491e+07 -2.100e-02 0.9829
## wordhurt -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordill -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordimpair -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordindebted -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordinterrupt -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordlack -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordlate 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordlaw -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordlawsuit -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordleadership -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordlegal -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordlimit -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordlose -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordloss -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordlost -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordmarkdown -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordmislabel -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordnonetheless NA NA NA NA
## wordpersist -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordpleas 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordpopular -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordpredict -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordprevent -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordprogress -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordprohibit -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordprolong -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordprotest -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordreassess 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordredress 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordreferenda -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordreferendum -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordregain -9.007e+15 8.221e+07 -1.096e+08 <2e-16 ***
## wordrestraint 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordrestrict -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordreward -9.007e+15 8.219e+07 -1.096e+08 <2e-16 ***
## wordrisk -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordsettlement -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordsever -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordshut -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordshutdown -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordslow -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordslowdown -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordslower 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordsmooth -9.007e+15 9.491e+07 -9.490e+07 <2e-16 ***
## wordstrain -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordstrength -9.007e+15 9.491e+07 -9.490e+07 <2e-16 ***
## wordstrengthen -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordstricter 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordstrong -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordstronger -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordsucceed -2.920e+05 9.491e+07 -3.000e-03 0.9975
## wordsuccess -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordsudden 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordsuffer -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordsuggest -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordsuperior -4.504e+15 6.711e+07 -6.711e+07 <2e-16 ***
## wordsuspend -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordthereon -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordthereto 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordthreat 4.276e+15 1.350e+14 3.167e+01 <2e-16 ***
## wordthreaten -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordtort -4.731e+15 1.350e+14 -3.504e+01 <2e-16 ***
## wordturmoil -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## worduncertain -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordunforeseen -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordunknown -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordunrest -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordweak -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordweaken -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordweaker -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## wordwin 1.894e+02 9.491e+07 0.000e+00 1.0000
## wordwinner 1.854e+02 9.491e+07 0.000e+00 1.0000
## wordworsen -2.274e+14 1.350e+14 -1.684e+00 0.0921 .
## word_count -1.018e-02 3.187e-02 -3.190e-01 0.7495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 575.12 on 414 degrees of freedom
## Residual deviance: 785.17 on 282 degrees of freedom
## AIC: 1051.2
##
## Number of Fisher Scoring iterations: 25
pred2 = predict(model2,type="response")
ct2 = table(train$Movement,pred2>0.5); ct2
##
## FALSE TRUE
## decrease 144 68
## increase 91 112
Accuracy rate of train data: 61.69%
pred2Test = predict(model2,newdata=testsubset,type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
ct2Test = table(testsubset$Movement,pred2Test>0.5); ct1Test
##
## FALSE TRUE
## decrease 24 29
## increase 21 30
Accuracy rate of test data: 47.78%
tree1 = rpart(Movement~sentiment+word_count,data=train, method="class")
summary(tree1)
## Call:
## rpart(formula = Movement ~ sentiment + word_count, data = train,
## method = "class")
## n= 415
##
## CP nsplit rel error xerror xstd
## 1 0.01724138 0 1.0000000 1.054187 0.05015160
## 2 0.01000000 2 0.9655172 1.201970 0.04939385
##
## Variable importance
## sentiment word_count
## 59 41
##
## Node number 1: 415 observations, complexity param=0.01724138
## predicted class=decrease expected loss=0.4891566 P(node) =1
## class counts: 212 203
## probabilities: 0.511 0.489
## left son=2 (118 obs) right son=3 (297 obs)
## Primary splits:
## sentiment splits as RLRLLR, improve=0.5277306, (0 missing)
## word_count < 2.5 to the right, improve=0.2685279, (0 missing)
##
## Node number 2: 118 observations
## predicted class=decrease expected loss=0.4491525 P(node) =0.2843373
## class counts: 65 53
## probabilities: 0.551 0.449
##
## Node number 3: 297 observations, complexity param=0.01724138
## predicted class=increase expected loss=0.4949495 P(node) =0.7156627
## class counts: 147 150
## probabilities: 0.495 0.505
## left son=6 (28 obs) right son=7 (269 obs)
## Primary splits:
## word_count < 8.5 to the right, improve=0.36164080, (0 missing)
## sentiment splits as L-R--L, improve=0.02825576, (0 missing)
##
## Node number 6: 28 observations
## predicted class=decrease expected loss=0.4285714 P(node) =0.06746988
## class counts: 16 12
## probabilities: 0.571 0.429
##
## Node number 7: 269 observations
## predicted class=increase expected loss=0.4869888 P(node) =0.6481928
## class counts: 131 138
## probabilities: 0.487 0.513
predTree1 = predict(tree1,type="class")
ct = table(train$Movement,predTree1); ct
## predTree1
## decrease increase
## decrease 81 131
## increase 65 138
Accuracy rate of train data: 52.77%
predTree1_test = predict(tree1,newdata=test,type="class")
ct = table(test$Movement,predTree1_test); ct
## predTree1_test
## decrease increase
## decrease 18 35
## increase 16 35
Accuracy rate of test data: 50.96%
tree2 = rpart(Movement~sentiment+word_count+word,data=train,method="class")
summary(tree2)
## Call:
## rpart(formula = Movement ~ sentiment + word_count + word, data = train,
## method = "class")
## n= 415
##
## CP nsplit rel error xerror xstd
## 1 0.29064039 0 1.0000000 1.113300 0.04997641
## 2 0.02216749 1 0.7093596 1.256158 0.04884386
## 3 0.01806240 3 0.6650246 1.266010 0.04872760
## 4 0.01477833 6 0.6108374 1.226601 0.04916246
## 5 0.01231527 8 0.5812808 1.177340 0.04959469
## 6 0.01000000 10 0.5566502 1.162562 0.04970069
##
## Variable importance
## word word_count sentiment
## 81 12 7
##
## Node number 1: 415 observations, complexity param=0.2906404
## predicted class=decrease expected loss=0.4891566 P(node) =1
## class counts: 212 203
## probabilities: 0.511 0.489
## left son=2 (106 obs) right son=3 (309 obs)
## Primary splits:
## word splits as RRRRRLRLRLLRLRLRRLLLRRLLRLRRRRRRRRRRLLRLLLLRRRRLLRLLRLRLLRLRRRLRLLRRRRRLLLRRRRRRRLRRLLLRRLRLRRLRLRLRLRRRRRRRRLRLLRRLLRRLLRRRLRRL, improve=27.3464200, (0 missing)
## sentiment splits as RLRLLR, improve= 0.5277306, (0 missing)
## word_count < 2.5 to the right, improve= 0.2685279, (0 missing)
## Surrogate splits:
## sentiment splits as RLRRLR, agree=0.754, adj=0.038, (0 split)
##
## Node number 2: 106 observations
## predicted class=decrease expected loss=0.1792453 P(node) =0.2554217
## class counts: 87 19
## probabilities: 0.821 0.179
##
## Node number 3: 309 observations, complexity param=0.02216749
## predicted class=increase expected loss=0.4045307 P(node) =0.7445783
## class counts: 125 184
## probabilities: 0.405 0.595
## left son=6 (270 obs) right son=7 (39 obs)
## Primary splits:
## word splits as RRLLR-L-L--L-L-LL---LR--R-LLRLRLLRLL--L----LRLR--L--L-R--L-LLR-L--LLLRL---RLLLLLL-RR---RL-L-LR-L-R-R-LRLLRLRL-L--RR--LL--LLL-RR-, improve=12.814890, (0 missing)
## word_count < 2.5 to the right, improve= 4.789470, (0 missing)
## sentiment splits as LRRL-L, improve= 1.018556, (0 missing)
##
## Node number 6: 270 observations, complexity param=0.02216749
## predicted class=increase expected loss=0.4592593 P(node) =0.6506024
## class counts: 124 146
## probabilities: 0.459 0.541
## left son=12 (133 obs) right son=13 (137 obs)
## Primary splits:
## word splits as --RR--L-R--L-L-RR---L-----LR-R-LR-LL--L----R-R---R--L----L-RL--R--LRR-R----LLRLLL-------R-L-L--R-----R-LL-R-R-L------LL--RRL----, improve=2.9155140, (0 missing)
## word_count < 2.5 to the right, improve=1.4861740, (0 missing)
## sentiment splits as LRRL-L, improve=0.4611373, (0 missing)
## Surrogate splits:
## sentiment splits as LRRL-L, agree=0.615, adj=0.218, (0 split)
## word_count < 5.5 to the right, agree=0.552, adj=0.090, (0 split)
##
## Node number 7: 39 observations
## predicted class=increase expected loss=0.02564103 P(node) =0.0939759
## class counts: 1 38
## probabilities: 0.026 0.974
##
## Node number 12: 133 observations, complexity param=0.0180624
## predicted class=decrease expected loss=0.4661654 P(node) =0.3204819
## class counts: 71 62
## probabilities: 0.534 0.466
## left son=24 (108 obs) right son=25 (25 obs)
## Primary splits:
## word_count < 7.5 to the left, improve=0.54215540, (0 missing)
## word splits as ------R----R-L------R-----R----R--RR--R-------------L----L--L-----L--------RL-RRR---------L-R----------LR-----R------RR----R----, improve=0.40601500, (0 missing)
## sentiment splits as LRRR-L, improve=0.08732329, (0 missing)
## Surrogate splits:
## word splits as ------L----L-L------L-----L----L--RL--L-------------L----L--L-----L--------LL-LLL---------R-L----------LL-----L------LL----L----, agree=0.895, adj=0.44, (0 split)
##
## Node number 13: 137 observations, complexity param=0.01477833
## predicted class=increase expected loss=0.3868613 P(node) =0.3301205
## class counts: 53 84
## probabilities: 0.387 0.613
## left son=26 (13 obs) right son=27 (124 obs)
## Primary splits:
## word_count < 8.5 to the right, improve=1.50014500, (0 missing)
## word splits as --RR----L------LR----------R-L--R----------L-L---L---------L---R---LL-R------L----------L------R-----R----L-L------------RR-----, improve=0.43239150, (0 missing)
## sentiment splits as LRRL-R, improve=0.08460578, (0 missing)
##
## Node number 24: 108 observations, complexity param=0.0180624
## predicted class=decrease expected loss=0.4444444 P(node) =0.260241
## class counts: 60 48
## probabilities: 0.556 0.444
## left son=48 (47 obs) right son=49 (61 obs)
## Primary splits:
## word splits as ------L----R-L------R-----R----R--LR--R-------------R----L--L-----L--------RL-RRR-----------R----------LR-----R------RR----R----, improve=1.1394020, (0 missing)
## word_count < 4.5 to the right, improve=0.8809020, (0 missing)
## sentiment splits as LRRL-R, improve=0.3883024, (0 missing)
## Surrogate splits:
## sentiment splits as LRRR-L, agree=0.713, adj=0.34, (0 split)
## word_count < 1.5 to the right, agree=0.639, adj=0.17, (0 split)
##
## Node number 25: 25 observations, complexity param=0.01477833
## predicted class=increase expected loss=0.44 P(node) =0.06024096
## class counts: 11 14
## probabilities: 0.440 0.560
## left son=50 (11 obs) right son=51 (14 obs)
## Primary splits:
## word splits as ------R-------------------L-------R-----------------L----R--------R-----------------------L-------------------------------------, improve=1.5148050, (0 missing)
## sentiment splits as R-LR-L, improve=1.3338890, (0 missing)
## word_count < 9.5 to the right, improve=0.4628571, (0 missing)
## Surrogate splits:
## sentiment splits as R-RR-L, agree=0.92, adj=0.818, (0 split)
## word_count < 19 to the right, agree=0.80, adj=0.545, (0 split)
##
## Node number 26: 13 observations
## predicted class=decrease expected loss=0.3846154 P(node) =0.0313253
## class counts: 8 5
## probabilities: 0.615 0.385
##
## Node number 27: 124 observations
## predicted class=increase expected loss=0.3629032 P(node) =0.2987952
## class counts: 45 79
## probabilities: 0.363 0.637
##
## Node number 48: 47 observations
## predicted class=decrease expected loss=0.3617021 P(node) =0.113253
## class counts: 30 17
## probabilities: 0.638 0.362
##
## Node number 49: 61 observations, complexity param=0.0180624
## predicted class=increase expected loss=0.4918033 P(node) =0.146988
## class counts: 30 31
## probabilities: 0.492 0.508
## left son=98 (15 obs) right son=99 (46 obs)
## Primary splits:
## word_count < 3.5 to the right, improve=2.320789000, (0 missing)
## word splits as -----------R--------L-----R----L---L--L-------------L----------------------L--LLL-----------L-----------L-----L------LL----L----, improve=0.037257820, (0 missing)
## sentiment splits as LLRL-L, improve=0.006088993, (0 missing)
## Surrogate splits:
## word splits as -----------R--------R-----R----R---R--R-------------L----------------------R--RRR-----------R-----------R-----R------RR----R----, agree=0.82, adj=0.267, (0 split)
##
## Node number 50: 11 observations
## predicted class=decrease expected loss=0.3636364 P(node) =0.02650602
## class counts: 7 4
## probabilities: 0.636 0.364
##
## Node number 51: 14 observations
## predicted class=increase expected loss=0.2857143 P(node) =0.03373494
## class counts: 4 10
## probabilities: 0.286 0.714
##
## Node number 98: 15 observations
## predicted class=decrease expected loss=0.2666667 P(node) =0.03614458
## class counts: 11 4
## probabilities: 0.733 0.267
##
## Node number 99: 46 observations, complexity param=0.01231527
## predicted class=increase expected loss=0.4130435 P(node) =0.1108434
## class counts: 19 27
## probabilities: 0.413 0.587
## left son=198 (29 obs) right son=199 (17 obs)
## Primary splits:
## word_count < 1.5 to the left, improve=1.7039420, (0 missing)
## word splits as -----------L--------L-----R----R---L--L------------------------------------L--RLL-----------R-----------L-----L------LL----R----, improve=1.2041150, (0 missing)
## sentiment splits as LRRL-L, improve=0.1785872, (0 missing)
## Surrogate splits:
## word splits as -----------R--------L-----L----R---L--L------------------------------------L--LLL-----------L-----------L-----L------LL----R----, agree=0.848, adj=0.588, (0 split)
## sentiment splits as LRLL-L, agree=0.739, adj=0.294, (0 split)
##
## Node number 198: 29 observations, complexity param=0.01231527
## predicted class=decrease expected loss=0.4827586 P(node) =0.06987952
## class counts: 15 14
## probabilities: 0.517 0.483
## left son=396 (9 obs) right son=397 (20 obs)
## Primary splits:
## word splits as --------------------L-----R--------R--R------------------------------------L--RRR-----------R-----------L-----R------RL---------, improve=1.7716480, (0 missing)
## sentiment splits as R-RL-L, improve=0.1788371, (0 missing)
## Surrogate splits:
## sentiment splits as R-RR-L, agree=0.793, adj=0.333, (0 split)
##
## Node number 199: 17 observations
## predicted class=increase expected loss=0.2352941 P(node) =0.04096386
## class counts: 4 13
## probabilities: 0.235 0.765
##
## Node number 396: 9 observations
## predicted class=decrease expected loss=0.2222222 P(node) =0.02168675
## class counts: 7 2
## probabilities: 0.778 0.222
##
## Node number 397: 20 observations
## predicted class=increase expected loss=0.4 P(node) =0.04819277
## class counts: 8 12
## probabilities: 0.400 0.600
predTree2 = predict(tree2,type="class")
ct = table(train$Movement,predTree2); ct
## predTree2
## decrease increase
## decrease 150 62
## increase 51 152
Accuracy rate of train data: 72.77%
predTree2_test = predict(tree2,newdata=testsubset,type="class")
ct = table(testsubset$Movement,predTree2_test); ct
## predTree2_test
## decrease increase
## decrease 20 23
## increase 25 22
Accuracy rate of test data: 46.67%
From the research conducted, annual financial reports can be indicative of the direction of stock fluctuations by understanding key words that can deliminate certain sentiments. The best model used was the Decision Tree with 3 predictor variables (sentiment, word, and word count) with a 73% accuracy rate on train data and a 47% accuracy rate on test data. Limitations of the project include sample size selection, volatility of external events, and ratio of words in each sentiment from the Loughran lexicon.
Next steps suggestions would be to increase the sample size of the dataset by including more companies and also, increasing the time span from previous annual reports.