Work Package 2: Textual Data

This document will present some basic statistics performed on a corpus of strategic corporate communication (Work Package 2: Textual Data). I have random sampled 33% (can be changed) of the total number of documents that comes down to n=1398. Because the method is probabilistic, I need to set seed for “replicability” of results.

This table shows summaries for the Excel sheet

Descriptive statistics for words-of-interest

I stem words (shortening words to their root forms) using Martin Porter’s stemming algorithm (included in Quanteda Package).

Frequencies grouped by year

##      feature frequency rank docfreq group
## 1       cibc       891    1       4  2000
## 2       year       802    2       6  2000
## 3      share       795    3       5  2000
## 4       oper       768    4       6  2000
## 5   interest       673    5       6  2000
## 6      incom       670    6       6  2000
## 7      manag       616    7       6  2000
## 8    financi       610    8       6  2000
## 9      asset       537    9       6  2000
## 10      rate       531   10       6  2000
## 11      busi       528   11       6  2000
## 12      risk       520   12       6  2000
## 13    market       506   13       6  2000
## 14    invest       499   14       6  2000
## 15     natur       489   15       5  2000
## 16      cost       478   16       6  2000
## 17   increas       453   17       6  2000
## 18    servic       453   17       5  2000
## 19      earn       449   19       5  2000
## 20     price       414   20       5  2000
## 21   product       411   21       6  2000
## 22    report       407   22       6  2000
## 23      valu       405   23       6  2000
## 24    includ       404   24       6  2000
## 25     capit       404   24       6  2000
## 26   enbridg       376   26       1  2000
## 27      loan       368   27       6  2000
## 28    credit       366   28       5  2000
## 29      bank       357   29       6  2000
## 30    revenu       346   30       6  2000
## 31    common       340   31       5  2000
## 32     relat       323   32       6  2000
## 33    custom       319   33       5  2000
## 34 statement       318   34       6  2000
## 35   pipelin       308   35       5  2000
## 36  consolid       305   36       6  2000
## 37      base       300   37       6  2000
## 38    provid       300   37       6  2000
## 39      cash       299   39       6  2000
## 40    corpor       295   40       6  2000
## 41      plan       280   41       6  2000
## 42   compani       278   42       6  2000
## 43    equiti       276   43       6  2000
## 44    result       276   43       6  2000
## 45     secur       273   45       6  2000
## 46    expens       264   46       5  2000
## 47      term       261   47       6  2000
## 48   account       260   48       6  2000
## 49 sharehold       257   49       6  2000
## 50    averag       257   49       5  2000

##        feature frequency rank docfreq group
## 12        risk       520   12       6  2000
## 4963      risk      1754   28      29  2001
## 15932     risk      4088   35      64  2002
## 39866     risk      3272    8      21  2003
## 47923     risk      4406   22      38  2004
## 61778     risk     10157   20      99  2005
## 104230    risk      2207   14      20  2006
## 112180    risk      2836   33      31  2007
## 124532    risk     24009   14     130  2008
## 180634    risk      3775    3      19  2009
## 192854    risk      5151   13      32  2010
## 206639    risk     35396   10     146  2011
## 271696    risk      2163   11      15  2012
## 282717    risk     10189    6      42  2013
## 302130    risk     49276    4     157  2014
## 364366    risk      3563    5      17  2015
## 372122    risk     10184   10      44  2016
## 388648    risk     59180    4     168  2017
## 466736    risk      8314    2      22  2018
## 481198    risk      5796   20      39  2019
## 494040    risk     59744    6     157  2020
## 559710    risk       124   32       1   316
## 1181    climat         8 1148       3  2000
## 6857    climat        18 1908       7  2001
## 17706   climat        67 1808      32  2002
## 41740   climat        17 1865       7  2003
## 49891   climat        39 1987      14  2004
## 63296   climat       170 1537      47  2005
## 106149  climat        16 1925       8  2006
## 113822  climat        46 1668      17  2007
## 125576  climat       615 1058      95  2008
## 181755  climat        65 1120      11  2009
## 194112  climat       100 1267      21  2010
## 207718  climat       745 1089     113  2011
## 273077  climat        31 1385       9  2012
## 284046  climat       133 1332      30  2013
## 303190  climat       910 1063     111  2014
## 365744  climat        37 1376       9  2015
## 373275  climat       209 1162      31  2016
## 389446  climat      1716  802     131  2017
## 467606  climat       145  870      18  2018
## 481935  climat       314  757      33  2019
## 494446  climat      4235  412     144  2020
## 561100  climat         2 1322       1   316
## 7485    carbon        10 2508       5  2001
## 18838   carbon        25 2921      13  2002
## 42240   carbon        10 2348       2  2003
## 50290   carbon        27 2379       9  2004
## 63418   carbon       148 1653      31  2005
## 106817  carbon         8 2562       5  2006

Plot demonstrates frequencies of the words-of-interest including “Risk”

This plot shows the same data excluding “Risk” and “Energy” due to their large volume.

Now i go on to transforming the corpus into matrices called: * DTM for Document-Term-Matrix (in Quanteda package) * DFM for Document-Feature-Matrix (in Dplyr package) When creating a matrix, I can choose whether I want to continue working with one-word-per-column (unigrams) or phrases with two words (bigrams) or three words (trigrams).

For demonstration, I chose unigrams and bigrams. At the same time this step trims the matrix to minimum number of frequencies per term = 3 and per document = 3. This means that each term has to repeat three times within a document and be present at least in 3 documents.

This reduces the corpus to make it more manageable.

TOPIC MODELLING

KeyATM

I created a dictionary for science skepticism (example).

Scien_Dict_KeyATM <- list(
  climate = c("climat", "green", "ghg", "environ", "pari", "polici", "chang", "interior", "natur", "emis"),
  science = c("scienc", "research", "certain", "pursuant", "fact", "innov", "technol"),
  energy = c("oil", "energi", "pipelin", "gas", "vehicl", "crude", "reserv"))

The output demonstrates a base KeyATM model with extra 3 topics It also shows which topic is the most common in a documents

##    1_climate 2_science   3_energy   Other_1  Other_2 Other_3
## 1       risk     oper      oper     group    group   bank
## 2   financi     year energi [✓]     year    board    loan
## 3      asset  financi       cost   report   share  credit
## 4       valu    includ  natur [1]    share    year mortgag
## 5    invest     stock      price  director financi   secur
## 6       loss    incom   product  financi    manag deposit
## 7     incom    share     includ  committe    oper  common
## 8      manag      cost       rate statement  report    cibc
## 9       rate      plan      year     board compani    lend
## 10 interest      cash      power    execut director    card

##    1_climate 2_science 3_energy Other_1 Other_2 Other_3
## 1        935       871      419     833     309    1238
## 2        527       841       62     558    1043     975
## 3       1357        54     1312     905      67     798
## 4        149       773      123     626     187     663
## 5        146      1318      156     993     192     525
## 6       1338       474      661     706     911    1157
## 7        900      1379      631     640     170     165
## 8        700      1154      328     403     607     565
## 9       1385       994     1177     945    1190     812
## 10         8      1192      243     511     557    1011

This plot visualizes ranking of the dictionary words in the corpus. It also provides summary statistics.

## # A tibble: 22 x 5
## # Groups:   Topic [3]
##    Word     WordCount `Proportion(%)` Ranking Topic    
##    <chr>        <int>           <dbl>   <int> <fct>    
##  1 chang       184698           0.388       1 1_climate
##  2 polici       90513           0.19        2 1_climate
##  3 natur        57551           0.121       3 1_climate
##  4 environ      29232           0.061       4 1_climate
##  5 pari         10350           0.022       5 1_climate
##  6 climat        9790           0.021       6 1_climate
##  7 green         4940           0.01        7 1_climate
##  8 interior      1365           0.003       8 1_climate
##  9 certain      83428           0.175       1 2_science
## 10 pursuant     23165           0.049       2 2_science
## # … with 12 more rows

Now, the same exercise but with bigrams (phrases)

Bigram Frequencies

##               feature frequency rank docfreq group
## 1        common_share       247    1       5  2000
## 2   financi_statement       194    2       6  2000
## 3       interest_rate       149    3       6  2000
## 4        balanc_sheet       144    4       6  2000
## 5           incom_tax       132    5       5  2000
## 6    consolid_financi       130    6       6  2000
## 7        prefer_share       130    6       5  2000
## 8           long_term       128    8       6  2000
## 9           cash_flow       119    9       6  2000
## 10           year_end       114   10       6  2000
## 11     interest_incom       114   10       5  2000
## 12        vice_presid       105   12       6  2000
## 13         risk_manag       104   13       5  2000
## 14          fair_valu       101   14       5  2000
## 15      natur_resourc       100   15       4  2000
## 16        cibc_report        97   16       2  2000
## 17       world_market        95   17       2  2000
## 18         cibc_world        95   17       2  2000
## 19    interest_expens        90   19       5  2000
## 20         unit_state        89   20       6  2000
## 21         power_hold        85   21       1  2000
## 22     enbridg_consum        80   22       1  2000
## 23    foreign_exchang        79   23       5  2000
## 24       asset_liabil        74   24       5  2000
## 25        nova_scotia        74   24       3  2000
## 26          term_debt        71   26       4  2000
## 27     board_director        66   27       5  2000
## 28   sharehold_equiti        65   28       6  2000
## 29       chief_execut        62   29       5  2000
## 30 consolid_statement        62   29       6  2000
## 31      note_consolid        60   31       6  2000
## 32        credit_loss        60   31       1  2000
## 33         cubic_feet        60   31       4  2000
## 34 financi_instrument        56   34       5  2000
## 35        execut_vice        56   34       4  2000
## 36     liquid_pipelin        55   36       2  2000
## 37        busi_govern        54   37       1  2000
## 38       presid_chief        53   38       5  2000
## 39        credit_risk        52   39       4  2000
## 40         great_lake        52   39       2  2000
## 41       execut_offic        51   41       4  2000
## 42     minor_interest        50   42       4  2000
## 43        market_risk        50   42       1  2000
## 44       wealth_manag        50   42       2  2000
## 45          seri_seri        49   45       2  2000
## 46      energi_servic        47   46       2  2000
## 47        credit_card        46   47       1  2000
## 48          gain_loss        46   47       5  2000
## 49        impair_loan        46   47       1  2000
## 50          year_year        45   50       4  2000

##                  feature frequency   rank docfreq group
## 1390        climat_chang         5   1366       2  2000
## 39519       climat_chang        13   2756       4  2001
## 177482      climat_chang        14   8832       7  2002
## 463212      climat_chang        10   4842       3  2003
## 597203      climat_chang        20   4634       6  2004
## 837171      climat_chang        47   3900      17  2005
## 1296977     climat_chang         7   6442       4  2006
## 1433380     climat_chang        19   4183      10  2007
## 1643329     climat_chang       192   1434      52  2008
## 2303674     climat_chang        27   1277       9  2009
## 2471386     climat_chang        62    938      16  2010
## 2727477     climat_chang       307   1033      76  2011
## 3478831     climat_chang        16   2110       7  2012
## 3606789     climat_chang        86   1188      25  2013
## 3942109     climat_chang       383    859      82  2014
## 4743792     climat_chang        24   1645       9  2015
## 4896409     climat_chang       149    648      26  2016
## 5249388     climat_chang       665    457      97  2017
## 6117757     climat_chang        81    583      18  2018
## 6319260     climat_chang       217    221      31  2019
## 6617510     climat_chang      1798    111     136  2020
## 7460352     climat_chang         2   1933       1   316
## 62438          crude_oil         3  17184       2  2001
## 239681         crude_oil         3  50595       2  2002
## 513560         crude_oil         2  31116       1  2003
## 695190         crude_oil         2  62491       1  2004
## 1038729        crude_oil         2 133485       2  2005
## 1418504        crude_oil         1  51583       1  2006
## 1483228        crude_oil         3  38070       3  2007
## 1984165        crude_oil         2 241944       2  2008
## 2535085        crude_oil         3  45547       2  2010
## 3610221        crude_oil        34   4451       3  2013
## 4255086        crude_oil         3 245683       1  2014
## 4800647        crude_oil         2  33175       1  2015
## 4899045        crude_oil        50   3196       3  2016
## 5434603        crude_oil         6 160566       3  2017
## 6303730        crude_oil         1  92160       1  2018
## 6328704        crude_oil        17   8965       2  2019
## 6839001        crude_oil         5 187926       2  2020
## 39828   research_develop        12   3078       8  2001
## 168958  research_develop       130    334      24  2002
## 459927  research_develop        23   1587       5  2003
## 594132  research_develop        44   1576      12  2004
## 833459  research_develop       338    195      52  2005
## 1291423 research_develop        28    927       8  2006
## 1430163 research_develop        52    977      12  2007
## 1642201 research_develop       530    306      67  2008
## 2303416 research_develop        31   1023       8  2009
## 2471411 research_develop        61    963      16  2010
## 2726726 research_develop       721    283      79  2011

This plot demonstrates frequencies of the bigrams-of-interest

Base KeyATM for bigrams

I created a dictionary for science skepticism (example).

Bigram_Dict_KeyATM <- list(
  climate = c("climat_chang", "global_climat" , "life_health"),
  science = c("research_develop", "exact_scienc"),
  energy = c("crude_oil", "pipelin_system", "feeder_pipelin", "natur_gas"))

The output demonstrates a base KeyATM model with extra 3 topics. It also shows which topic is the most common in a documents

##            1_climate         2_science          3_energy           Other_1
## 1          long_term financi_statement         long_term    audit_committe
## 2  financi_statement         fair_valu         cash_flow      execut_offic
## 3          cash_flow    board_director      common_share    board_director
## 4           year_end   incom_statement       vice_presid      chief_execut
## 5     account_polici     financi_asset  consolid_financi      prefer_share
## 6      defin_benefit   execut_director  capit_expenditur    extern_auditor
## 7    foreign_exchang  consolid_financi         oper_cost     servic_provid
## 8         prior_year supervisori_board financi_statement financi_statement
## 9       exchang_rate      balanc_sheet       plant_equip    financi_report
## 10   foreign_currenc       profit_loss    properti_plant    intern_control
##             Other_2           Other_3
## 1         fair_valu          year_end
## 2     interest_rate         fair_valu
## 3      balanc_sheet financi_statement
## 4       credit_risk  consolid_financi
## 5    interest_incom         cash_flow
## 6        real_estat      common_stock
## 7        risk_manag       result_oper
## 8         gain_loss     note_consolid
## 9     asset_liabil    intern_control
## 10 consolid_financi         incom_tax

##    1_climate 2_science 3_energy Other_1 Other_2 Other_3
## 1        832      1189      418    1237     561     870
## 2        140       170     1311     974     437     840
## 3        365       335      243     492     546    1378
## 4        801       658     1176     672     618    1277
## 5       1336       911       62     488     195    1305
## 6         90       909      660     494      68    1125
## 7        504       406      156     215     279     253
## 8        445      1215      327     629     717     615
## 9        621       595      885     797     741     751
## 10       193       805     1129     393    1017     501

Work Package 2: Textual Data

Nikita Sleptcov

04/10/2020