SUMMARY

This continuation of BIAS AND CONTEXT IN PRESIDENTIAL DEBATE TEXTS, which focused on a “Bag of Words” approach to analyzing the text of Presidential Debates.

This analysis shows a “Heat Map” of frequent words. It is not really a new analysys, but just a better way of visualizing the data. I also

DATA SOURCES AND METHODS

The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformatted .txt files.

CANDIDATE WORD FREQUENCIES

We can check word frequency directly by tokenizing the text and counting single words. (Note: this is a partial duplication of the work done in the first analysis. But as the word vector analysis below leverages some of the output of this, it’s reproduced here in a slightly different format as a control of quality)

There are a total of 4853 words in the combined vocabulary of the candidates.

word trump sanders clinton rubio cruz all
people 105 162 117 86 24 494
going 84 85 90 56 13 328
think 31 111 150 10 13 315
country 64 119 57 52 10 302
know 47 55 115 37 41 295
will 43 48 70 45 55 261
well 26 58 104 23 26 237
need 23 72 73 15 27 210
president 10 25 75 68 30 208
now 32 40 48 53 27 200
just 44 45 47 28 13 177
tax 9 15 8 47 51 130
obama 4 5 26 10 37 82
SUM 4224 9306 9742 4607 5033 32912

- Hilary Clinton spoke a total of 9742 and had a vocabulary of 2477 words.
- Bernie Sanders spoke 9306 total words, with a vocabulary of 2109.
- Donald Trump spoke 4224 and with a vocabulary of 1206.
- Ted Cruz spoke 4607 and with a vocabulary of 1688.
- Marco Rubio spoke 5033 and with a vocabulary of 1583.

A “heat map” of frequent words shows several interesting patterns. For instance, all candidates but one use the word “people” with high frequency. Conversely, only one candidate mentions the word “tax” frequently.

NORMALIZED WORD FREQUENCIES

Words frequencies convey differences from one candidate to the next. This is a graph of the “top” words used by all candidates, normalized by word count, \(\nu_{i} = W_{i} / \sum_{k=1}^{N} W_{k}\), where \(\nu_{i}\) is the normalized frequency of word \(i\) with count \(W_{i}\). The \(\nu_{i}\) for each candidate are plotted for the most-used words as measured for the ensemble of all candidates.
An interesting way to look at the differences in word frequencies is using Zipf’s Law to compare frequencies of words both in the overall vocabulary of the debates and the individual candidate responses. Zipf’s law states that the frequency \(\nu_{i}\) of a word is inversely proportional to its rank.
In the graph below the overall behavior (taking all the candidate speech) shows this law is followed fairly well. What’s interesting is to plot along side it the speech of the individual candidates. Zipf’s law provides a “baseline” for vocabulary usage. Since many of hte words used are the same, it’s deviation from the baseline that will provide insights into different interpretations of speech in a “bag of words” model.

CONCLUSIONS

Candidate word choices vary from candidate to candidate. While the overall speech follows expected linguistic behavior, the candidate’s usages vary remarkably. This provides some basis for believeing a “bag of words” approach can provide at least some intelligence into candidate poistions and biases. The differences appear to be most profound at higher ranking words, suggesting this might be a place to look for greater subtlety in sentiment.