Fiscal talk by the Fed: First cut
KnowLegPo team
1. Creating a dictionary of fiscal-related expressions
We do not want to be missing keywords, especially since the vocabularies to discuss fiscal issues may have changed in the fifty years period. Therefore, to supplement the dictionary of keywords presented below, we decided to test using automated or semi-automated process of keyword selection, starting a “seed” with fiscal, deficit and budget which are undeniably relevant.
1.1 Embeddings and cosine similarity
The first step is word-embeddings + cosine similarity. First, we pre-process the whole corpus, removing numbers and stop-words, converting to lowercase, and grouping the sentences back at the intervention/speech level. Then, we create word vectors using GloVe (Global Vectors for Word Representation) which uses a skip-gram window of 5 words to capture context and generates 100-dimensional vectors representing word meanings. We then explore the top 100 words that are the closest to fiscal, the top 100 closest to deficit and the top 100 words for budget.
This only gives us the best single words (or 1-grams) matches to each of our three seed words. Our seed words always appear in the top 5 closest words of the other seed words, which is good indication that they are indeed very closely related. Some words appear in several lists, which suggests they are also highly relevant. In total, there are 203 unique words that appear in the top 100 of one, two or three of our seed words. These 203 unique words are listed below, sorted by order of summed proximity with our three seed-words
Combined Terms Analysis
Some of these 203 words like sequestration, taxation or austerity appear likely to be directly relevant for our research question. We could add them to our dictionary of relevant keywords directly. However, most words, despite having indeed some proximity to our seed concept, appear too broad. For example, “surplus” may often be used to talk about fiscal issues, but not always.
1.2 Inspecting bigrams
The obvious solution in the literature is to move away from single keywords to consider key expressions, or n-grams (expressions composed of a number n of words). Rather than using our 203 words as is, we should create expressions containing these words to make sure they are indeed used in a relevant context. For example, we may want to use “government spending” (rather than “government” or “spending”), or “federal expenditures” (rather than “federal” or “expenditures”).
Of course, there is a balance to strike. Keeping all words single would mean capturing all relevant discussion, but also (a lot of) false positives. Using only bigrams instead may lead to fewer (or even no) false positives, but risks missing relevant discussions, especially since some words like “budget” or “deficit” may often be used on their own to refer to the “federal budget” or the “fiscal deficit”.
To make an informed decision about this trade off with insight from our data, we follow three steps. First, we use the package quanteda to identify all the bigrams in our corpus, after having removed the stop words such as “the”, “have” and other very frequent words that do not carry meaning. Second, we zoom in on all the bigrams in which our 203 candidate keywords appear. Third, we count the number of times these potentially relevant bigrams appear in our corpus to order them by relevance. Last, we manually check the top 10 bigrams per candidate keyword, excluding those that include “fiscal”, “budget” or “deficit”, since those will be already captured. This process allows us to screen from 2030 likely relevant expressions, that are listed in the table below:
Terms
1.3 Creating a first dictionary
We complement this manual inspection of the 2030 terms above with checks of how these terms are used in context in our corpus, and end up with a first dictionary of relevant keywords and key expressions.
Single Keywords (Unigrams)
First, we keep a few single keywords that are either our seed words, are highly specific (minimizing risks of false positives), and/or are too often used solo, making a bigram approach impossible. For those single keywords, an exclusion step will be created later to minimize chances of extracting too many false positives.
budget,budgetary(seed word)deficit,deficits(seed word)fiscal(seed word)austerity(highly specific)entitlement(highly specific)sequestration(highly specific)tax;taxes,taxation(highly specific)defense(often solo)
Two-Word Expressions (Bigrams)
Then we add more specific bigrams, that were either found directly in the list bigrams (e.g. federal spending), or constructed by combining expressions of interest (e.g. welfare spending).
It should be noted that bigrams containing the keywords above are not included. For example, “fiscal spending” will already be captured thanks to “fiscal”, so we do not need to add this expression to our dictionary.
Spending-related terms:
federal spendinggovernment spendinggovernmental spendingmilitary spendingwelfare spendingsocial spendingstate spendingspending cutsspending programs
Expenditure-related terms:
benefits expenditurefederal expendituregovernment expendituregovernmental expendituremilitary expendituresocial expenditurestate expenditurewelfare expenditure
Revenue-related terms:
benefits revenuefederal revenuegovernment revenuegovernmental revenuemilitary revenuesocial revenuestate revenuewelfare revenue
Cutback-related terms:
benefits cutbackfederal cutbackgovernment cutbackgovernmental cutbackmilitary cutbacksocial cutbackstate cutbackwelfare cutback
Outlay-related terms:
benefits outlaysfederal outlaysgovernment outlaysgovernmental outlaysmilitary outlayssocial outlaysstate outlayswelfare outlays
Policy-related terms:
policy stimulusstimulus packagereform package
Benefits-related terms:
social benefitsbenefits paymentsbenefits transfersbenefits increasebenefits reform
To validate that those terms are indeed relevant, we select randomly 5 sentences per expression from our corpus, and extract them with context (one sentence before, one sentence after):
Five Random Examples per Term
Some of these bigrams have no match in the corpus, and are therefore removed from our final inclusion dictionary:
benefits cutbackbenefits expenditurebenefits outlaysbenefits paymentsbecause it is only there twice and captured by otherbenefits reformbenefits revenuebenefits spendingbenefits transfersgovernmental cutbackmilitary cutbacksocial cutbacksocial outlayssocial revenuewelfare cutbackwelfare revenue
Therefore, our final dictionary contains 43 terms, including 11 single words and 32 bigrams, as listed below.
benefits increasefederal cutbackfederal expenditurefederal outlaysfederal revenuefederal spendinggovernment cutbackgovernment expendituregovernment outlaysgovernment revenuegovernment spendinggovernmental expendituregovernmental outlaysgovernmental revenuegovernmental spendingmilitary expendituremilitary outlaysmilitary spendingpolicy stimulusreform packagesocial benefitssocial expendituresocial spendingspending cutsspending programsstate expenditurestate spendingstimulus packagewelfare expenditurewelfare outlayswelfare spendingausteritybudgetbudgetarydefensedeficitentitlementfiscalsequestrationtaxtaxationtaxes
1.5 Creating an exclusion list for single keywords
We do a last step to reduce the chance of getting false positives. For our 11 single keywords, we extract all the bigrams that include these terms and that appear at least 100 times in our corpus.
bigram frequency
1 fiscal policy 4411
2 federal budget 1685
3 budget deficit 1576
4 monetary fiscal 1390
5 trade deficit 1278
6 budget deficits 1129
7 account deficit 1026
8 fiscal stimulus 1009
9 fiscal monetary 932
10 fiscal policies 857
11 fiscal year 701
12 federal deficit 618
13 tax cut 587
14 tax cuts 585
15 congressional budget 578
16 income tax 511
17 budget office 501
18 tax rates 473
19 defense spending 378
20 tax credit 376
21 federal deficits 360
22 tax revenues 354
23 deficit reduction 350
24 fiscal cliff 347
25 trade deficits 343
26 account deficits 332
27 tax reform 317
28 fiscal restraint 299
29 fiscal authorities 291
30 tax increases 270
31 investment tax 257
32 policy fiscal 250
33 fiscal deficits 240
34 tax system 232
35 fiscal actions 226
36 tax rate 215
37 government deficits 214
38 fiscal situation 206
39 tax code 203
40 deficit $ 202
41 deficit spending 202
42 u.s fiscal 202
43 fiscal discipline 201
44 fiscal drag 201
45 tax reduction 201
46 income taxes 197
47 unified budget 193
48 balanced budget 185
49 payroll tax 185
50 fiscal side 182
51 tax policy 181
52 government budget 170
53 current fiscal 166
54 billion fiscal 165
55 fiscal deficit 162
56 tax changes 160
57 tax reductions 160
58 billion deficit 159
59 higher taxes 159
60 federal fiscal 157
61 fiscal support 157
62 large deficits 157
63 large budget 156
64 tax receipts 155
65 fiscal regulatory 153
66 line defense 153
67 tax incentives 151
68 deficit financing 145
69 tax revenue 145
70 tax spending 143
71 fiscal problems 141
72 fiscal policymakers 138
73 fiscal sustainability 137
74 federal tax 136
75 fiscal package 130
76 budget surplus 129
77 budget process 124
78 strong fiscal 124
79 tax base 120
80 tax increase 120
81 tax structure 120
82 fiscal imbalances 118
83 fiscal consolidation 117
84 corporate tax 114
85 deficit fiscal 114
86 changes tax 113
87 government deficit 112
88 tax treatment 112
89 entitlement programs 111
90 tax laws 111
91 tax burden 109
92 budget surpluses 108
93 effects fiscal 107
94 nondefense capital 106
95 tax credits 106
96 tax law 106
97 reduce deficit 104
98 payroll taxes 103
99 fiscal challenges 102
100 fiscal years 101
101 tax policies 101
To help reach a decision, we also look at five random examples of these bigrams in context:
Examples of High-Frequency Bigrams
After manual inspection, we decide to exclude: * trade deficits, trade deficit * account deficits, account deficit * fiscal years, fiscal year
Of course, these bigrams are not excluded from the corpus, but only from our dictionary. It means that we may still read them in excerpts, provided that these excerpts also include another term of interest from the dictionary above.