This document will present some basic statistics performed on a corpus of strategic corporate communication (Work Package 2: Textual Data). I have random sampled 33% (can be changed) of the total number of documents that comes down to n=1398. Because the method is probabilistic, I need to set seed for āreplicabilityā of results.
This table shows summaries for the Excel sheet
Descriptive statistics for words-of-interest
I stem words (shortening words to their root forms) using Martin Porterās stemming algorithm (included in Quanteda Package).
Frequencies grouped by year
## feature frequency rank docfreq group
## 1 cibc 891 1 4 2000
## 2 year 802 2 6 2000
## 3 share 795 3 5 2000
## 4 oper 768 4 6 2000
## 5 interest 673 5 6 2000
## 6 incom 670 6 6 2000
## 7 manag 616 7 6 2000
## 8 financi 610 8 6 2000
## 9 asset 537 9 6 2000
## 10 rate 531 10 6 2000
## 11 busi 528 11 6 2000
## 12 risk 520 12 6 2000
## 13 market 506 13 6 2000
## 14 invest 499 14 6 2000
## 15 natur 489 15 5 2000
## 16 cost 478 16 6 2000
## 17 increas 453 17 6 2000
## 18 servic 453 17 5 2000
## 19 earn 449 19 5 2000
## 20 price 414 20 5 2000
## 21 product 411 21 6 2000
## 22 report 407 22 6 2000
## 23 valu 405 23 6 2000
## 24 includ 404 24 6 2000
## 25 capit 404 24 6 2000
## 26 enbridg 376 26 1 2000
## 27 loan 368 27 6 2000
## 28 credit 366 28 5 2000
## 29 bank 357 29 6 2000
## 30 revenu 346 30 6 2000
## 31 common 340 31 5 2000
## 32 relat 323 32 6 2000
## 33 custom 319 33 5 2000
## 34 statement 318 34 6 2000
## 35 pipelin 308 35 5 2000
## 36 consolid 305 36 6 2000
## 37 base 300 37 6 2000
## 38 provid 300 37 6 2000
## 39 cash 299 39 6 2000
## 40 corpor 295 40 6 2000
## 41 plan 280 41 6 2000
## 42 compani 278 42 6 2000
## 43 equiti 276 43 6 2000
## 44 result 276 43 6 2000
## 45 secur 273 45 6 2000
## 46 expens 264 46 5 2000
## 47 term 261 47 6 2000
## 48 account 260 48 6 2000
## 49 sharehold 257 49 6 2000
## 50 averag 257 49 5 2000
## feature frequency rank docfreq group
## 12 risk 520 12 6 2000
## 4963 risk 1754 28 29 2001
## 15932 risk 4088 35 64 2002
## 39866 risk 3272 8 21 2003
## 47923 risk 4406 22 38 2004
## 61778 risk 10157 20 99 2005
## 104230 risk 2207 14 20 2006
## 112180 risk 2836 33 31 2007
## 124532 risk 24009 14 130 2008
## 180634 risk 3775 3 19 2009
## 192854 risk 5151 13 32 2010
## 206639 risk 35396 10 146 2011
## 271696 risk 2163 11 15 2012
## 282717 risk 10189 6 42 2013
## 302130 risk 49276 4 157 2014
## 364366 risk 3563 5 17 2015
## 372122 risk 10184 10 44 2016
## 388648 risk 59180 4 168 2017
## 466736 risk 8314 2 22 2018
## 481198 risk 5796 20 39 2019
## 494040 risk 59744 6 157 2020
## 559710 risk 124 32 1 316
## 1181 climat 8 1148 3 2000
## 6857 climat 18 1908 7 2001
## 17706 climat 67 1808 32 2002
## 41740 climat 17 1865 7 2003
## 49891 climat 39 1987 14 2004
## 63296 climat 170 1537 47 2005
## 106149 climat 16 1925 8 2006
## 113822 climat 46 1668 17 2007
## 125576 climat 615 1058 95 2008
## 181755 climat 65 1120 11 2009
## 194112 climat 100 1267 21 2010
## 207718 climat 745 1089 113 2011
## 273077 climat 31 1385 9 2012
## 284046 climat 133 1332 30 2013
## 303190 climat 910 1063 111 2014
## 365744 climat 37 1376 9 2015
## 373275 climat 209 1162 31 2016
## 389446 climat 1716 802 131 2017
## 467606 climat 145 870 18 2018
## 481935 climat 314 757 33 2019
## 494446 climat 4235 412 144 2020
## 561100 climat 2 1322 1 316
## 7485 carbon 10 2508 5 2001
## 18838 carbon 25 2921 13 2002
## 42240 carbon 10 2348 2 2003
## 50290 carbon 27 2379 9 2004
## 63418 carbon 148 1653 31 2005
## 106817 carbon 8 2562 5 2006
Plot demonstrates frequencies of the words-of-interest including āRiskā
This plot shows the same data excluding āRiskā and āEnergyā due to their large volume.
Now i go on to transforming the corpus into matrices called: * DTM for Document-Term-Matrix (in Quanteda package) * DFM for Document-Feature-Matrix (in Dplyr package) When creating a matrix, I can choose whether I want to continue working with one-word-per-column (unigrams) or phrases with two words (bigrams) or three words (trigrams).
For demonstration, I chose unigrams and bigrams. At the same time this step trims the matrix to minimum number of frequencies per term = 3 and per document = 3. This means that each term has to repeat three times within a document and be present at least in 3 documents.
This reduces the corpus to make it more manageable.
TOPIC MODELLING
KeyATM
I created a dictionary for science skepticism (example).
Scien_Dict_KeyATM <- list(
climate = c("climat", "green", "ghg", "environ", "pari", "polici", "chang", "interior", "natur", "emis"),
science = c("scienc", "research", "certain", "pursuant", "fact", "innov", "technol"),
energy = c("oil", "energi", "pipelin", "gas", "vehicl", "crude", "reserv"))
The output demonstrates a base KeyATM model with extra 3 topics It also shows which topic is the most common in a documents
## 1_climate 2_science 3_energy Other_1 Other_2 Other_3
## 1 risk opeĀr opeĀr group group bankĀ
## 2 finanĀci yearĀ energi [ā] yearĀ board loan
## 3 asset finanĀci cost reportĀ shareĀ credit
## 4 valu includ natur [1] shareĀ yearĀ mortgag
## 5 investĀ stock price director finanĀci secur
## 6 loss iĀncom prodĀuct finanĀci manag deposit
## 7 iĀncom shareĀ includ committe opeĀr common
## 8 manag cost rate statement reportĀ cibc
## 9 rate plan yearĀ board comĀpani lend
## 10 interĀest cash power execut director card
## 1_climate 2_science 3_energy Other_1 Other_2 Other_3
## 1 935 871 419 833 309 1238
## 2 527 841 62 558 1043 975
## 3 1357 54 1312 905 67 798
## 4 149 773 123 626 187 663
## 5 146 1318 156 993 192 525
## 6 1338 474 661 706 911 1157
## 7 900 1379 631 640 170 165
## 8 700 1154 328 403 607 565
## 9 1385 994 1177 945 1190 812
## 10 8 1192 243 511 557 1011
This plot visualizes ranking of the dictionary words in the corpus. It also provides summary statistics.
## # A tibble: 22 x 5
## # Groups: Topic [3]
## Word WordCount `Proportion(%)` Ranking Topic
## <chr> <int> <dbl> <int> <fct>
## 1 chang 184698 0.388 1 1_climate
## 2 polici 90513 0.19 2 1_climate
## 3 natur 57551 0.121 3 1_climate
## 4 environ 29232 0.061 4 1_climate
## 5 pari 10350 0.022 5 1_climate
## 6 climat 9790 0.021 6 1_climate
## 7 green 4940 0.01 7 1_climate
## 8 interior 1365 0.003 8 1_climate
## 9 certain 83428 0.175 1 2_science
## 10 pursuant 23165 0.049 2 2_science
## # ⦠with 12 more rows
Now, the same exercise but with bigrams (phrases)
Bigram Frequencies
## feature frequency rank docfreq group
## 1 common_share 247 1 5 2000
## 2 financi_statement 194 2 6 2000
## 3 interest_rate 149 3 6 2000
## 4 balanc_sheet 144 4 6 2000
## 5 incom_tax 132 5 5 2000
## 6 consolid_financi 130 6 6 2000
## 7 prefer_share 130 6 5 2000
## 8 long_term 128 8 6 2000
## 9 cash_flow 119 9 6 2000
## 10 year_end 114 10 6 2000
## 11 interest_incom 114 10 5 2000
## 12 vice_presid 105 12 6 2000
## 13 risk_manag 104 13 5 2000
## 14 fair_valu 101 14 5 2000
## 15 natur_resourc 100 15 4 2000
## 16 cibc_report 97 16 2 2000
## 17 world_market 95 17 2 2000
## 18 cibc_world 95 17 2 2000
## 19 interest_expens 90 19 5 2000
## 20 unit_state 89 20 6 2000
## 21 power_hold 85 21 1 2000
## 22 enbridg_consum 80 22 1 2000
## 23 foreign_exchang 79 23 5 2000
## 24 asset_liabil 74 24 5 2000
## 25 nova_scotia 74 24 3 2000
## 26 term_debt 71 26 4 2000
## 27 board_director 66 27 5 2000
## 28 sharehold_equiti 65 28 6 2000
## 29 chief_execut 62 29 5 2000
## 30 consolid_statement 62 29 6 2000
## 31 note_consolid 60 31 6 2000
## 32 credit_loss 60 31 1 2000
## 33 cubic_feet 60 31 4 2000
## 34 financi_instrument 56 34 5 2000
## 35 execut_vice 56 34 4 2000
## 36 liquid_pipelin 55 36 2 2000
## 37 busi_govern 54 37 1 2000
## 38 presid_chief 53 38 5 2000
## 39 credit_risk 52 39 4 2000
## 40 great_lake 52 39 2 2000
## 41 execut_offic 51 41 4 2000
## 42 minor_interest 50 42 4 2000
## 43 market_risk 50 42 1 2000
## 44 wealth_manag 50 42 2 2000
## 45 seri_seri 49 45 2 2000
## 46 energi_servic 47 46 2 2000
## 47 credit_card 46 47 1 2000
## 48 gain_loss 46 47 5 2000
## 49 impair_loan 46 47 1 2000
## 50 year_year 45 50 4 2000
## feature frequency rank docfreq group
## 1390 climat_chang 5 1366 2 2000
## 39519 climat_chang 13 2756 4 2001
## 177482 climat_chang 14 8832 7 2002
## 463212 climat_chang 10 4842 3 2003
## 597203 climat_chang 20 4634 6 2004
## 837171 climat_chang 47 3900 17 2005
## 1296977 climat_chang 7 6442 4 2006
## 1433380 climat_chang 19 4183 10 2007
## 1643329 climat_chang 192 1434 52 2008
## 2303674 climat_chang 27 1277 9 2009
## 2471386 climat_chang 62 938 16 2010
## 2727477 climat_chang 307 1033 76 2011
## 3478831 climat_chang 16 2110 7 2012
## 3606789 climat_chang 86 1188 25 2013
## 3942109 climat_chang 383 859 82 2014
## 4743792 climat_chang 24 1645 9 2015
## 4896409 climat_chang 149 648 26 2016
## 5249388 climat_chang 665 457 97 2017
## 6117757 climat_chang 81 583 18 2018
## 6319260 climat_chang 217 221 31 2019
## 6617510 climat_chang 1798 111 136 2020
## 7460352 climat_chang 2 1933 1 316
## 62438 crude_oil 3 17184 2 2001
## 239681 crude_oil 3 50595 2 2002
## 513560 crude_oil 2 31116 1 2003
## 695190 crude_oil 2 62491 1 2004
## 1038729 crude_oil 2 133485 2 2005
## 1418504 crude_oil 1 51583 1 2006
## 1483228 crude_oil 3 38070 3 2007
## 1984165 crude_oil 2 241944 2 2008
## 2535085 crude_oil 3 45547 2 2010
## 3610221 crude_oil 34 4451 3 2013
## 4255086 crude_oil 3 245683 1 2014
## 4800647 crude_oil 2 33175 1 2015
## 4899045 crude_oil 50 3196 3 2016
## 5434603 crude_oil 6 160566 3 2017
## 6303730 crude_oil 1 92160 1 2018
## 6328704 crude_oil 17 8965 2 2019
## 6839001 crude_oil 5 187926 2 2020
## 39828 research_develop 12 3078 8 2001
## 168958 research_develop 130 334 24 2002
## 459927 research_develop 23 1587 5 2003
## 594132 research_develop 44 1576 12 2004
## 833459 research_develop 338 195 52 2005
## 1291423 research_develop 28 927 8 2006
## 1430163 research_develop 52 977 12 2007
## 1642201 research_develop 530 306 67 2008
## 2303416 research_develop 31 1023 8 2009
## 2471411 research_develop 61 963 16 2010
## 2726726 research_develop 721 283 79 2011
This plot demonstrates frequencies of the bigrams-of-interest
Base KeyATM for bigrams
I created a dictionary for science skepticism (example).
Bigram_Dict_KeyATM <- list(
climate = c("climat_chang", "global_climat" , "life_health"),
science = c("research_develop", "exact_scienc"),
energy = c("crude_oil", "pipelin_system", "feeder_pipelin", "natur_gas"))
The output demonstrates a base KeyATM model with extra 3 topics. It also shows which topic is the most common in a documents
## 1_climate 2_science 3_energy Other_1
## 1 long_term financi_statement long_term audit_committe
## 2 financi_statement fair_valu cash_flow execut_offic
## 3 cash_flow board_director common_share board_director
## 4 year_end incom_statement vice_presid chief_execut
## 5 account_polici financi_asset consolid_financi prefer_share
## 6 defin_benefit execut_director capit_expenditur extern_auditor
## 7 foreign_exchang consolid_financi oper_cost servic_provid
## 8 prior_year supervisori_board financi_statement financi_statement
## 9 exchang_rate balanc_sheet plant_equip financi_report
## 10 foreign_currenc profit_loss properti_plant intern_control
## Other_2 Other_3
## 1 fair_valu year_end
## 2 interest_rate fair_valu
## 3 balanc_sheet financi_statement
## 4 credit_risk consolid_financi
## 5 interest_incom cash_flow
## 6 real_estat common_stock
## 7 risk_manag result_oper
## 8 gain_loss note_consolid
## 9 asset_liabilĀ intern_control
## 10 consolid_financi incom_tax
## 1_climate 2_science 3_energy Other_1 Other_2 Other_3
## 1 832 1189 418 1237 561 870
## 2 140 170 1311 974 437 840
## 3 365 335 243 492 546 1378
## 4 801 658 1176 672 618 1277
## 5 1336 911 62 488 195 1305
## 6 90 909 660 494 68 1125
## 7 504 406 156 215 279 253
## 8 445 1215 327 629 717 615
## 9 621 595 885 797 741 751
## 10 193 805 1129 393 1017 501