This continuation of BIAS AND CONTEXT IN PRESIDENTIAL DEBATE TEXTS, which focused on a “Bag of Words” approach to analyzing the text of Presidential Debates.
This analysis shows a “Heat Map” of frequent words. It is not really a new analysys, but just a better way of visualizing the data. I also
The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformatted .txt files.
We can check word frequency directly by tokenizing the text and counting single words. (Note: this is a partial duplication of the work done in the first analysis. But as the word vector analysis below leverages some of the output of this, it’s reproduced here in a slightly different format as a control of quality)
There are a total of 4853 words in the combined vocabulary of the candidates.
word | trump | sanders | clinton | rubio | cruz | all |
---|---|---|---|---|---|---|
people | 105 | 162 | 117 | 86 | 24 | 494 |
going | 84 | 85 | 90 | 56 | 13 | 328 |
think | 31 | 111 | 150 | 10 | 13 | 315 |
country | 64 | 119 | 57 | 52 | 10 | 302 |
know | 47 | 55 | 115 | 37 | 41 | 295 |
will | 43 | 48 | 70 | 45 | 55 | 261 |
well | 26 | 58 | 104 | 23 | 26 | 237 |
need | 23 | 72 | 73 | 15 | 27 | 210 |
president | 10 | 25 | 75 | 68 | 30 | 208 |
now | 32 | 40 | 48 | 53 | 27 | 200 |
just | 44 | 45 | 47 | 28 | 13 | 177 |
tax | 9 | 15 | 8 | 47 | 51 | 130 |
obama | 4 | 5 | 26 | 10 | 37 | 82 |
SUM | 4224 | 9306 | 9742 | 4607 | 5033 | 32912 |
- Hilary Clinton spoke a total of 9742 and had a vocabulary of 2477 words.
- Bernie Sanders spoke 9306 total words, with a vocabulary of 2109.
- Donald Trump spoke 4224 and with a vocabulary of 1206.
- Ted Cruz spoke 4607 and with a vocabulary of 1688.
- Marco Rubio spoke 5033 and with a vocabulary of 1583.
A “heat map” of frequent words shows several interesting patterns. For instance, all candidates but one use the word “people” with high frequency. Conversely, only one candidate mentions the word “tax” frequently.
Words frequencies convey differences from one candidate to the next. This is a graph of the “top” words used by all candidates, normalized by word count, \(\nu_{i} = W_{i} / \sum_{k=1}^{N} W_{k}\), where \(\nu_{i}\) is the normalized frequency of word \(i\) with count \(W_{i}\). The \(\nu_{i}\) for each candidate are plotted for the most-used words as measured for the ensemble of all candidates.
An interesting way to look at the differences in word frequencies is using Zipf’s Law to compare frequencies of words both in the overall vocabulary of the debates and the individual candidate responses. Zipf’s law states that the frequency \(\nu_{i}\) of a word is inversely proportional to its rank.
In the graph below the overall behavior (taking all the candidate speech) shows this law is followed fairly well. What’s interesting is to plot along side it the speech of the individual candidates. Zipf’s law provides a “baseline” for vocabulary usage. Since many of hte words used are the same, it’s deviation from the baseline that will provide insights into different interpretations of speech in a “bag of words” model.
Candidate word choices vary from candidate to candidate. While the overall speech follows expected linguistic behavior, the candidate’s usages vary remarkably. This provides some basis for believeing a “bag of words” approach can provide at least some intelligence into candidate poistions and biases. The differences appear to be most profound at higher ranking words, suggesting this might be a place to look for greater subtlety in sentiment.