10K Text Mining Analysis

The project aims to model financial sentiment– using sentiment analysis and text mining– to predict stock price fluctuations. Leveraging sentiment analysis, the model will systematically process the valence of words in various companies’ annual financial reports from 3 sections - Risk Factors, Management’s Discussion, and Auditor’s comments.

Packages Used:

library(readxl) #read excel
library(tidyverse) #data cleaning
library(tidytext) #text mining
library(textstem) #text mining
library(caTools) #logistic regression
library(rpart); library(rpart.plot) #decision tree

Data Exploration

10 Companies:

##  [1] "AMZN" "HD"   "MCD"  "NKE"  "SBUX" "LOW"  "TJX"  "BKNG" "TGT"  "F"

Data Structure:

## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  7 variables:
##  $ TICKER                 : chr  "AMZN" "HD" "MCD" "NKE" ...
##  $ RISK FACTORS           : chr  "Please carefully consider the following discussion of significant factors, events, and uncertainties that make "| __truncated__ "Our business, results of operations, and financial condition are subject to numerous risks and uncertainties. I"| __truncated__ "If we do not successfully evolve and execute against our business strategies, including under the Velocity Grow"| __truncated__ "Special Note Regarding Forward-Looking Statements and Analyst Reports\r\nCertain written and oral statements, o"| __truncated__ ...
##  $ MANAGEMENT'S DISCUSSION: chr  "Forward-Looking Statements\r\nThis Annual Report on Form 10-K includes forward-looking statements within the me"| __truncated__ "We reported net sales of $108.2 billion in fiscal 2018 . Net earnings were $11.1 billion , or $9.73 per diluted"| __truncated__ "MANAGEMENT'S VIEW OF THE BUSINESS\r\nIn analyzing business trends, management reviews results on a constant cur"| __truncated__ "NIKE designs, develops, markets and sells athletic footwear, apparel, equipment, accessories and services world"| __truncated__ ...
##  $ AUDITORS COMMENTS      : chr  "Opinion on the Financial Statements\r\nWe have audited the accompanying consolidated balance sheets of Amazon.c"| __truncated__ "Opinion on the Consolidated Financial Statements\r\nWe have audited the accompanying Consolidated Balance Sheet"| __truncated__ "Opinion on Internal Control over Financial Reporting\r\n\r\nWe have audited McDonald’s Corporation’s internal c"| __truncated__ "Opinions on the Financial Statements and Internal Control over Financial Reporting\r\nWe have audited the accom"| __truncated__ ...
##  $ STOCK PRICE PRE        : num  1870.7 191.9 210.1 86.7 84.2 ...
##  $ STOCK PRICE POST       : num  2008.7 195.6 201 87.3 84 ...
##  $ Movement               : Factor w/ 2 levels "decrease","increase": 2 2 1 2 1 2 2 1 1 1

Text Mining

Stop words removed

Unigram

Bigram

Data Cleaning

Convert all words to lowercase:

project$`RISK FACTORS`<- tolower(project$`RISK FACTORS`)
project$`MANAGEMENT'S DISCUSSION`<- tolower(project$`MANAGEMENT'S DISCUSSION`)
project$`AUDITORS COMMENTS`<- tolower(project$`AUDITORS COMMENTS`)

Stem words:

project$AC_stem <- project$`AUDITORS COMMENTS` %>% stem_strings()
project$MD_stem <- project$`MANAGEMENT'S DISCUSSION` %>% stem_strings()
project$RF_stem <- project$`RISK FACTORS` %>% stem_strings()

Loughran Financial Lexicon - Sentiment Analysis

This section of code sets up the section variable dataframes for analysis by utilizing the sentiment analyzer lexicon - Loughran. The 6 sentiment categories are Constraining, Litigious, Positive, Negative, Superfluous, and Uncertainty.

RFstem_loughran <- project %>%
  unnest_tokens(word, `RF_stem`) %>%
  anti_join(stop_words) %>%
  count(TICKER,`STOCK PRICE PRE`,`STOCK PRICE POST`,Movement, word, sort = TRUE) %>%
  ungroup() %>%
  inner_join(get_sentiments("loughran"), by = "word");

MDstem_loughran <- project %>%
  unnest_tokens(word, `MD_stem`) %>%
  anti_join(stop_words) %>%
  count(TICKER,`STOCK PRICE PRE`,`STOCK PRICE POST`,Movement, word, sort = TRUE) %>%
  ungroup() %>%
  inner_join(get_sentiments("loughran"), by = "word")

ACstem_loughran <- project %>%
  unnest_tokens(word, `AC_stem`) %>%
  anti_join(stop_words) %>%
  count(TICKER,`STOCK PRICE PRE`,`STOCK PRICE POST`,Movement, word, sort = TRUE) %>%
  ungroup() %>%
  inner_join(get_sentiments("loughran"), by = "word")

Risk Factors:

## # A tibble: 466 x 7
##    TICKER `STOCK PRICE PRE` `STOCK PRICE PO… Movement word      n sentiment
##    <chr>              <dbl>            <dbl> <fct>    <chr> <int> <chr>    
##  1 F                   8.31             8.25 decrease risk     35 uncertai…
##  2 AMZN             1871.            2009.   increase risk     31 uncertai…
##  3 NKE                86.7             87.3  increase risk     24 uncertai…
##  4 LOW               109.             110.   increase risk     23 uncertai…
##  5 MCD               210.             201    decrease succ…    23 positive 
##  6 TJX                53.2             54.1  increase risk     22 uncertai…
##  7 MCD               210.             201    decrease depe…    21 uncertai…
##  8 MCD               210.             201    decrease depe…    21 constrai…
##  9 HD                192.             196.   increase risk     18 uncertai…
## 10 MCD               210.             201    decrease risk     18 uncertai…
## # … with 456 more rows

Management’s Discussion:

## # A tibble: 114 x 7
##    TICKER `STOCK PRICE PRE` `STOCK PRICE PO… Movement word      n sentiment
##    <chr>              <dbl>            <dbl> <fct>    <chr> <int> <chr>    
##  1 BKNG             1678.            1660.   decrease risk     11 uncertai…
##  2 HD                192.             196.   increase bene…    10 positive 
##  3 AMZN             1871.            2009.   increase risk      7 uncertai…
##  4 BKNG             1678.            1660.   decrease canc…     7 negative 
##  5 TJX                53.2             54.1  increase bene…     7 positive 
##  6 BKNG             1678.            1660.   decrease slow      6 negative 
##  7 F                   8.31             8.25 decrease loss      6 negative 
##  8 HD                192.             196.   increase loss      5 negative 
##  9 BKNG             1678.            1660.   decrease bene…     4 positive 
## 10 BKNG             1678.            1660.   decrease conc…     4 negative 
## # … with 104 more rows

Auditor’s Comments:

## # A tibble: 76 x 7
##    TICKER `STOCK PRICE PRE` `STOCK PRICE PO… Movement word      n sentiment
##    <chr>              <dbl>            <dbl> <fct>    <chr> <int> <chr>    
##  1 F                   8.31             8.25 decrease bene…    25 positive 
##  2 F                   8.31             8.25 decrease loss     21 negative 
##  3 F                   8.31             8.25 decrease risk      5 uncertai…
##  4 NKE                86.7             87.3  increase risk      5 uncertai…
##  5 TJX                53.2             54.1  increase risk      5 uncertai…
##  6 AMZN             1871.            2009.   increase bene…     4 positive 
##  7 F                   8.31             8.25 decrease claim     4 litigious
##  8 F                   8.31             8.25 decrease sever     4 negative 
##  9 AMZN             1871.            2009.   increase law       3 litigious
## 10 BKNG             1678.            1660.   decrease risk      3 uncertai…
## # … with 66 more rows

Risk Factors:

Management’s Discussion:

Auditor’s Comments:

Merging All 3 Sections into Dataframe

allstem<-bind_rows(list(ACstem_loughran,MDstem_loughran,RFstem_loughran))
allstem <- allstem %>% group_by(TICKER,`STOCK PRICE PRE`,`STOCK PRICE POST`,Movement, sentiment,word) %>% summarise(sum(n))
colnames(allstem)[which(names(allstem) == "sum(n)")] <- "word_count"
allstem

## # A tibble: 519 x 7
## # Groups:   TICKER, STOCK PRICE PRE, STOCK PRICE POST, Movement, sentiment
## #   [51]
##    TICKER `STOCK PRICE PR… `STOCK PRICE PO… Movement sentiment word 
##    <chr>             <dbl>            <dbl> <fct>    <chr>     <chr>
##  1 AMZN              1871.            2009. increase constrai… comm…
##  2 AMZN              1871.            2009. increase constrai… depe…
##  3 AMZN              1871.            2009. increase constrai… impa…
##  4 AMZN              1871.            2009. increase constrai… limit
##  5 AMZN              1871.            2009. increase constrai… prev…
##  6 AMZN              1871.            2009. increase constrai… proh…
##  7 AMZN              1871.            2009. increase constrai… rest…
##  8 AMZN              1871.            2009. increase litigious amend
##  9 AMZN              1871.            2009. increase litigious brea…
## 10 AMZN              1871.            2009. increase litigious claim
## # … with 509 more rows, and 1 more variable: word_count <int>

Splitting Dataset

Spliting dataset with 80% in train & 20% in test:

  set.seed(617)
  split = sample.split(Y = allstem$Movement, SplitRatio = 0.8)
  train = allstem[split,]
  test = allstem[!split,]

Finding unique words in test that are not in train:

trainwords=unique(train$word)
      testwords = unique(test$word)
      
      wwtest = c()
      
      for (i in testwords){
        if (!(i %in% trainwords)){
          wwtest = cbind(wwtest,i)
        }
      }
      
      wwtest

##      i       i      i           i           i        i           i        
## [1,] "break" "spam" "complaint" "constrain" "entail" "therefrom" "default"
##      i            i           i             i         i      i      
## [1,] "misconduct" "stringent" "reinterpret" "thereof" "poor" "wrong"

Remove unique words in test:
- This is necessary since when the model tries to predict on test using the word variable, new words not in the train dataset will produce an error.

testunique = c("break","spam","complaint","constrain","entail","therefrom","default","misconduct","stringent","reinterpret","thereof","poor","wrong")
      testsubset = subset(test, ! word %in% testunique)

Logistic Regression

Model 1

Model 1 includes two predictor variables (sentiment, word count)

model1 = glm(Movement~sentiment+word_count,data=train,family='binomial')
summary(model1)

## 
## Call:
## glm(formula = Movement ~ sentiment + word_count, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.197  -1.186  -1.064   1.164   1.311  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -0.020833   0.291952  -0.071    0.943
## sentimentlitigious    -0.211310   0.387285  -0.546    0.585
## sentimentnegative      0.070159   0.313043   0.224    0.823
## sentimentpositive     -0.092569   0.381448  -0.243    0.808
## sentimentsuperfluous -13.540767 535.411245  -0.025    0.980
## sentimentuncertainty   0.067989   0.469148   0.145    0.885
## word_count            -0.004466   0.019989  -0.223    0.823
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 575.12  on 414  degrees of freedom
## Residual deviance: 572.65  on 408  degrees of freedom
## AIC: 586.65
## 
## Number of Fisher Scoring iterations: 12

Predict using train

pred1 = predict(model1,type="response")
  ct1 = table(train$Movement,pred1>0.5); ct1

##           
##            FALSE TRUE
##   decrease    99  113
##   increase    85  118

Accuracy rate of train data: 52.29%

Predict using test

pred1Test = predict(model1,newdata=test,type="response")
  ct1Test = table(test$Movement,pred1Test>0.5); ct1Test

##           
##            FALSE TRUE
##   decrease    24   29
##   increase    21   30

Accuracy rate of test data: 51.92%

Model 2

Model 2 includes three predictor variables (sentiment, word, word count)

model2 = glm(Movement~sentiment+word+word_count,data=train,family='binomial')

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model2)

## 
## Call:
## glm(formula = Movement ~ sentiment + word + word_count, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.897  -1.004   0.000   1.044   8.490  
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error    z value Pr(>|z|)    
## (Intercept)           2.274e+14  1.350e+14  1.684e+00   0.0921 .  
## sentimentlitigious   -1.054e+00  1.679e+00 -6.270e-01   0.5304    
## sentimentnegative    -2.202e+00  1.078e+00 -2.042e+00   0.0412 *  
## sentimentpositive     4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## sentimentsuperfluous -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## sentimentuncertainty -2.481e-01  1.056e+00 -2.350e-01   0.8143    
## wordaftermath        -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordamend            -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordappeal           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordbad               4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordbarrier          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordbenefit          -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordboycott          -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordbreach           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordburden           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordcancel           -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordclaim            -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordclawback         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordcommit           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordcompel           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordconcern          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordconflict         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordconsent          -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordconstraint       -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordcontract         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordcorrupt          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordcounterfeit      -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordcourt            -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordcrime            -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordcurtail           4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordcut              -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddefect           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddefend           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddenial           -2.273e+14  1.350e+14 -1.684e+00   0.0923 .  
## worddepend           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddeter             4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## worddifficult        -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddiminish         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddisproportion     4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## worddisrupt          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddistract         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddistress         -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## worddivert           -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## worddivest           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worddownturn         -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## worddownward         -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## worddrought          -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordeasier           -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## worderror            -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordexploit           4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordfail             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordfault             4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordflaw             -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordforego           -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordfraud            -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordgain             -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordharass           -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordharm             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordhinder           -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordhonor            -2.029e+06  9.491e+07 -2.100e-02   0.9829    
## wordhurt             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordill              -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordimpair           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordindebted         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordinterrupt        -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordlack             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordlate              4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordlaw              -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordlawsuit          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordleadership       -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordlegal            -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordlimit            -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordlose             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordloss             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordlost             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordmarkdown         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordmislabel         -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordnonetheless              NA         NA         NA       NA    
## wordpersist          -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordpleas             4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordpopular          -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordpredict          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordprevent          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordprogress         -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordprohibit         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordprolong          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordprotest          -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordreassess          4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordredress           4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordreferenda        -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordreferendum       -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordregain           -9.007e+15  8.221e+07 -1.096e+08   <2e-16 ***
## wordrestraint         4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordrestrict         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordreward           -9.007e+15  8.219e+07 -1.096e+08   <2e-16 ***
## wordrisk             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordsettlement       -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordsever            -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordshut             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordshutdown         -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordslow             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordslowdown         -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordslower            4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordsmooth           -9.007e+15  9.491e+07 -9.490e+07   <2e-16 ***
## wordstrain           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordstrength         -9.007e+15  9.491e+07 -9.490e+07   <2e-16 ***
## wordstrengthen       -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordstricter          4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordstrong           -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordstronger         -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordsucceed          -2.920e+05  9.491e+07 -3.000e-03   0.9975    
## wordsuccess          -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordsudden            4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordsuffer           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordsuggest          -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordsuperior         -4.504e+15  6.711e+07 -6.711e+07   <2e-16 ***
## wordsuspend          -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordthereon          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordthereto           4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordthreat            4.276e+15  1.350e+14  3.167e+01   <2e-16 ***
## wordthreaten         -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordtort             -4.731e+15  1.350e+14 -3.504e+01   <2e-16 ***
## wordturmoil          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## worduncertain        -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordunforeseen       -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordunknown          -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordunrest           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordweak             -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordweaken           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordweaker           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## wordwin               1.894e+02  9.491e+07  0.000e+00   1.0000    
## wordwinner            1.854e+02  9.491e+07  0.000e+00   1.0000    
## wordworsen           -2.274e+14  1.350e+14 -1.684e+00   0.0921 .  
## word_count           -1.018e-02  3.187e-02 -3.190e-01   0.7495    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 575.12  on 414  degrees of freedom
## Residual deviance: 785.17  on 282  degrees of freedom
## AIC: 1051.2
## 
## Number of Fisher Scoring iterations: 25

Predict using train

pred2 = predict(model2,type="response")
  ct2 = table(train$Movement,pred2>0.5); ct2

##           
##            FALSE TRUE
##   decrease   144   68
##   increase    91  112

Accuracy rate of train data: 61.69%

Predict using test subset data

pred2Test = predict(model2,newdata=testsubset,type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading

  ct2Test = table(testsubset$Movement,pred2Test>0.5); ct1Test

##           
##            FALSE TRUE
##   decrease    24   29
##   increase    21   30

Accuracy rate of test data: 47.78%

Decision Tree

Model 1

Model 1 includes two predictor variables (sentiment, word count)

tree1 = rpart(Movement~sentiment+word_count,data=train, method="class")
summary(tree1)

## Call:
## rpart(formula = Movement ~ sentiment + word_count, data = train, 
##     method = "class")
##   n= 415 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.01724138      0 1.0000000 1.054187 0.05015160
## 2 0.01000000      2 0.9655172 1.201970 0.04939385
## 
## Variable importance
##  sentiment word_count 
##         59         41 
## 
## Node number 1: 415 observations,    complexity param=0.01724138
##   predicted class=decrease  expected loss=0.4891566  P(node) =1
##     class counts:   212   203
##    probabilities: 0.511 0.489 
##   left son=2 (118 obs) right son=3 (297 obs)
##   Primary splits:
##       sentiment  splits as  RLRLLR,  improve=0.5277306, (0 missing)
##       word_count < 2.5 to the right, improve=0.2685279, (0 missing)
## 
## Node number 2: 118 observations
##   predicted class=decrease  expected loss=0.4491525  P(node) =0.2843373
##     class counts:    65    53
##    probabilities: 0.551 0.449 
## 
## Node number 3: 297 observations,    complexity param=0.01724138
##   predicted class=increase  expected loss=0.4949495  P(node) =0.7156627
##     class counts:   147   150
##    probabilities: 0.495 0.505 
##   left son=6 (28 obs) right son=7 (269 obs)
##   Primary splits:
##       word_count < 8.5 to the right, improve=0.36164080, (0 missing)
##       sentiment  splits as  L-R--L,  improve=0.02825576, (0 missing)
## 
## Node number 6: 28 observations
##   predicted class=decrease  expected loss=0.4285714  P(node) =0.06746988
##     class counts:    16    12
##    probabilities: 0.571 0.429 
## 
## Node number 7: 269 observations
##   predicted class=increase  expected loss=0.4869888  P(node) =0.6481928
##     class counts:   131   138
##    probabilities: 0.487 0.513

Predict using train

predTree1 = predict(tree1,type="class")
ct = table(train$Movement,predTree1); ct

##           predTree1
##            decrease increase
##   decrease       81      131
##   increase       65      138

Accuracy rate of train data: 52.77%

Predict using test

predTree1_test = predict(tree1,newdata=test,type="class")
  ct = table(test$Movement,predTree1_test); ct

##           predTree1_test
##            decrease increase
##   decrease       18       35
##   increase       16       35

Accuracy rate of test data: 50.96%

Model 2

Model 2 includes three predictor variables (sentiment, word, word count)

tree2 = rpart(Movement~sentiment+word_count+word,data=train,method="class")
summary(tree2)

## Call:
## rpart(formula = Movement ~ sentiment + word_count + word, data = train, 
##     method = "class")
##   n= 415 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.29064039      0 1.0000000 1.113300 0.04997641
## 2 0.02216749      1 0.7093596 1.256158 0.04884386
## 3 0.01806240      3 0.6650246 1.266010 0.04872760
## 4 0.01477833      6 0.6108374 1.226601 0.04916246
## 5 0.01231527      8 0.5812808 1.177340 0.04959469
## 6 0.01000000     10 0.5566502 1.162562 0.04970069
## 
## Variable importance
##       word word_count  sentiment 
##         81         12          7 
## 
## Node number 1: 415 observations,    complexity param=0.2906404
##   predicted class=decrease  expected loss=0.4891566  P(node) =1
##     class counts:   212   203
##    probabilities: 0.511 0.489 
##   left son=2 (106 obs) right son=3 (309 obs)
##   Primary splits:
##       word       splits as  RRRRRLRLRLLRLRLRRLLLRRLLRLRRRRRRRRRRLLRLLLLRRRRLLRLLRLRLLRLRRRLRLLRRRRRLLLRRRRRRRLRRLLLRRLRLRRLRLRLRLRRRRRRRRLRLLRRLLRRLLRRRLRRL, improve=27.3464200, (0 missing)
##       sentiment  splits as  RLRLLR, improve= 0.5277306, (0 missing)
##       word_count < 2.5 to the right, improve= 0.2685279, (0 missing)
##   Surrogate splits:
##       sentiment splits as  RLRRLR, agree=0.754, adj=0.038, (0 split)
## 
## Node number 2: 106 observations
##   predicted class=decrease  expected loss=0.1792453  P(node) =0.2554217
##     class counts:    87    19
##    probabilities: 0.821 0.179 
## 
## Node number 3: 309 observations,    complexity param=0.02216749
##   predicted class=increase  expected loss=0.4045307  P(node) =0.7445783
##     class counts:   125   184
##    probabilities: 0.405 0.595 
##   left son=6 (270 obs) right son=7 (39 obs)
##   Primary splits:
##       word       splits as  RRLLR-L-L--L-L-LL---LR--R-LLRLRLLRLL--L----LRLR--L--L-R--L-LLR-L--LLLRL---RLLLLLL-RR---RL-L-LR-L-R-R-LRLLRLRL-L--RR--LL--LLL-RR-, improve=12.814890, (0 missing)
##       word_count < 2.5 to the right, improve= 4.789470, (0 missing)
##       sentiment  splits as  LRRL-L, improve= 1.018556, (0 missing)
## 
## Node number 6: 270 observations,    complexity param=0.02216749
##   predicted class=increase  expected loss=0.4592593  P(node) =0.6506024
##     class counts:   124   146
##    probabilities: 0.459 0.541 
##   left son=12 (133 obs) right son=13 (137 obs)
##   Primary splits:
##       word       splits as  --RR--L-R--L-L-RR---L-----LR-R-LR-LL--L----R-R---R--L----L-RL--R--LRR-R----LLRLLL-------R-L-L--R-----R-LL-R-R-L------LL--RRL----, improve=2.9155140, (0 missing)
##       word_count < 2.5 to the right, improve=1.4861740, (0 missing)
##       sentiment  splits as  LRRL-L, improve=0.4611373, (0 missing)
##   Surrogate splits:
##       sentiment  splits as  LRRL-L,  agree=0.615, adj=0.218, (0 split)
##       word_count < 5.5 to the right, agree=0.552, adj=0.090, (0 split)
## 
## Node number 7: 39 observations
##   predicted class=increase  expected loss=0.02564103  P(node) =0.0939759
##     class counts:     1    38
##    probabilities: 0.026 0.974 
## 
## Node number 12: 133 observations,    complexity param=0.0180624
##   predicted class=decrease  expected loss=0.4661654  P(node) =0.3204819
##     class counts:    71    62
##    probabilities: 0.534 0.466 
##   left son=24 (108 obs) right son=25 (25 obs)
##   Primary splits:
##       word_count < 7.5 to the left,  improve=0.54215540, (0 missing)
##       word       splits as  ------R----R-L------R-----R----R--RR--R-------------L----L--L-----L--------RL-RRR---------L-R----------LR-----R------RR----R----, improve=0.40601500, (0 missing)
##       sentiment  splits as  LRRR-L, improve=0.08732329, (0 missing)
##   Surrogate splits:
##       word splits as  ------L----L-L------L-----L----L--RL--L-------------L----L--L-----L--------LL-LLL---------R-L----------LL-----L------LL----L----, agree=0.895, adj=0.44, (0 split)
## 
## Node number 13: 137 observations,    complexity param=0.01477833
##   predicted class=increase  expected loss=0.3868613  P(node) =0.3301205
##     class counts:    53    84
##    probabilities: 0.387 0.613 
##   left son=26 (13 obs) right son=27 (124 obs)
##   Primary splits:
##       word_count < 8.5 to the right, improve=1.50014500, (0 missing)
##       word       splits as  --RR----L------LR----------R-L--R----------L-L---L---------L---R---LL-R------L----------L------R-----R----L-L------------RR-----, improve=0.43239150, (0 missing)
##       sentiment  splits as  LRRL-R, improve=0.08460578, (0 missing)
## 
## Node number 24: 108 observations,    complexity param=0.0180624
##   predicted class=decrease  expected loss=0.4444444  P(node) =0.260241
##     class counts:    60    48
##    probabilities: 0.556 0.444 
##   left son=48 (47 obs) right son=49 (61 obs)
##   Primary splits:
##       word       splits as  ------L----R-L------R-----R----R--LR--R-------------R----L--L-----L--------RL-RRR-----------R----------LR-----R------RR----R----, improve=1.1394020, (0 missing)
##       word_count < 4.5 to the right, improve=0.8809020, (0 missing)
##       sentiment  splits as  LRRL-R, improve=0.3883024, (0 missing)
##   Surrogate splits:
##       sentiment  splits as  LRRR-L,  agree=0.713, adj=0.34, (0 split)
##       word_count < 1.5 to the right, agree=0.639, adj=0.17, (0 split)
## 
## Node number 25: 25 observations,    complexity param=0.01477833
##   predicted class=increase  expected loss=0.44  P(node) =0.06024096
##     class counts:    11    14
##    probabilities: 0.440 0.560 
##   left son=50 (11 obs) right son=51 (14 obs)
##   Primary splits:
##       word       splits as  ------R-------------------L-------R-----------------L----R--------R-----------------------L-------------------------------------, improve=1.5148050, (0 missing)
##       sentiment  splits as  R-LR-L, improve=1.3338890, (0 missing)
##       word_count < 9.5 to the right, improve=0.4628571, (0 missing)
##   Surrogate splits:
##       sentiment  splits as  R-RR-L,  agree=0.92, adj=0.818, (0 split)
##       word_count < 19  to the right, agree=0.80, adj=0.545, (0 split)
## 
## Node number 26: 13 observations
##   predicted class=decrease  expected loss=0.3846154  P(node) =0.0313253
##     class counts:     8     5
##    probabilities: 0.615 0.385 
## 
## Node number 27: 124 observations
##   predicted class=increase  expected loss=0.3629032  P(node) =0.2987952
##     class counts:    45    79
##    probabilities: 0.363 0.637 
## 
## Node number 48: 47 observations
##   predicted class=decrease  expected loss=0.3617021  P(node) =0.113253
##     class counts:    30    17
##    probabilities: 0.638 0.362 
## 
## Node number 49: 61 observations,    complexity param=0.0180624
##   predicted class=increase  expected loss=0.4918033  P(node) =0.146988
##     class counts:    30    31
##    probabilities: 0.492 0.508 
##   left son=98 (15 obs) right son=99 (46 obs)
##   Primary splits:
##       word_count < 3.5 to the right, improve=2.320789000, (0 missing)
##       word       splits as  -----------R--------L-----R----L---L--L-------------L----------------------L--LLL-----------L-----------L-----L------LL----L----, improve=0.037257820, (0 missing)
##       sentiment  splits as  LLRL-L, improve=0.006088993, (0 missing)
##   Surrogate splits:
##       word splits as  -----------R--------R-----R----R---R--R-------------L----------------------R--RRR-----------R-----------R-----R------RR----R----, agree=0.82, adj=0.267, (0 split)
## 
## Node number 50: 11 observations
##   predicted class=decrease  expected loss=0.3636364  P(node) =0.02650602
##     class counts:     7     4
##    probabilities: 0.636 0.364 
## 
## Node number 51: 14 observations
##   predicted class=increase  expected loss=0.2857143  P(node) =0.03373494
##     class counts:     4    10
##    probabilities: 0.286 0.714 
## 
## Node number 98: 15 observations
##   predicted class=decrease  expected loss=0.2666667  P(node) =0.03614458
##     class counts:    11     4
##    probabilities: 0.733 0.267 
## 
## Node number 99: 46 observations,    complexity param=0.01231527
##   predicted class=increase  expected loss=0.4130435  P(node) =0.1108434
##     class counts:    19    27
##    probabilities: 0.413 0.587 
##   left son=198 (29 obs) right son=199 (17 obs)
##   Primary splits:
##       word_count < 1.5 to the left,  improve=1.7039420, (0 missing)
##       word       splits as  -----------L--------L-----R----R---L--L------------------------------------L--RLL-----------R-----------L-----L------LL----R----, improve=1.2041150, (0 missing)
##       sentiment  splits as  LRRL-L, improve=0.1785872, (0 missing)
##   Surrogate splits:
##       word      splits as  -----------R--------L-----L----R---L--L------------------------------------L--LLL-----------L-----------L-----L------LL----R----, agree=0.848, adj=0.588, (0 split)
##       sentiment splits as  LRLL-L, agree=0.739, adj=0.294, (0 split)
## 
## Node number 198: 29 observations,    complexity param=0.01231527
##   predicted class=decrease  expected loss=0.4827586  P(node) =0.06987952
##     class counts:    15    14
##    probabilities: 0.517 0.483 
##   left son=396 (9 obs) right son=397 (20 obs)
##   Primary splits:
##       word      splits as  --------------------L-----R--------R--R------------------------------------L--RRR-----------R-----------L-----R------RL---------, improve=1.7716480, (0 missing)
##       sentiment splits as  R-RL-L, improve=0.1788371, (0 missing)
##   Surrogate splits:
##       sentiment splits as  R-RR-L, agree=0.793, adj=0.333, (0 split)
## 
## Node number 199: 17 observations
##   predicted class=increase  expected loss=0.2352941  P(node) =0.04096386
##     class counts:     4    13
##    probabilities: 0.235 0.765 
## 
## Node number 396: 9 observations
##   predicted class=decrease  expected loss=0.2222222  P(node) =0.02168675
##     class counts:     7     2
##    probabilities: 0.778 0.222 
## 
## Node number 397: 20 observations
##   predicted class=increase  expected loss=0.4  P(node) =0.04819277
##     class counts:     8    12
##    probabilities: 0.400 0.600

Predict using train

predTree2 = predict(tree2,type="class")
ct = table(train$Movement,predTree2); ct

##           predTree2
##            decrease increase
##   decrease      150       62
##   increase       51      152

Accuracy rate of train data: 72.77%

Predict using test subset

 predTree2_test = predict(tree2,newdata=testsubset,type="class")
  ct = table(testsubset$Movement,predTree2_test); ct

##           predTree2_test
##            decrease increase
##   decrease       20       23
##   increase       25       22

Accuracy rate of test data: 46.67%

Conclusion

From the research conducted, annual financial reports can be indicative of the direction of stock fluctuations by understanding key words that can deliminate certain sentiments. The best model used was the Decision Tree with 3 predictor variables (sentiment, word, and word count) with a 73% accuracy rate on train data and a 47% accuracy rate on test data. Limitations of the project include sample size selection, volatility of external events, and ratio of words in each sentiment from the Loughran lexicon.

Next steps suggestions would be to increase the sample size of the dataset by including more companies and also, increasing the time span from previous annual reports.

10K Text Mining Analysis

Loan Le

5/12/2020

Data Exploration

Text Mining

Unigram

Bigram

Data Cleaning

Loughran Financial Lexicon - Sentiment Analysis

Risk Factors:

Management’s Discussion:

Auditor’s Comments:

Merging All 3 Sections into Dataframe

Splitting Dataset

Logistic Regression

Model 1

Model 2

Decision Tree

Model 1

Model 2

Conclusion