HEAT MAPS & LINGUISTICS OF DEBATE SPEECH in R

SUMMARY

This continuation of BIAS AND CONTEXT IN PRESIDENTIAL DEBATE TEXTS, which focused on a “Bag of Words” approach to analyzing the text of Presidential Debates.

This analysis shows a “Heat Map” of frequent words. It is not really a new analysys, but just a better way of visualizing the data. I also

DATA SOURCES AND METHODS

The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformatted .txt files.

CANDIDATE WORD FREQUENCIES

We can check word frequency directly by tokenizing the text and counting single words. (Note: this is a partial duplication of the work done in the first analysis. But as the word vector analysis below leverages some of the output of this, it’s reproduced here in a slightly different format as a control of quality)

There are a total of 4853 words in the combined vocabulary of the candidates.

word	trump	sanders	clinton	rubio	cruz	all
people	105	162	117	86	24	494
going	84	85	90	56	13	328
think	31	111	150	10	13	315
country	64	119	57	52	10	302
know	47	55	115	37	41	295
will	43	48	70	45	55	261
well	26	58	104	23	26	237
need	23	72	73	15	27	210
president	10	25	75	68	30	208
now	32	40	48	53	27	200
just	44	45	47	28	13	177
tax	9	15	8	47	51	130
obama	4	5	26	10	37	82
SUM	4224	9306	9742	4607	5033	32912

- Hilary Clinton spoke a total of 9742 and had a vocabulary of 2477 words.
- Bernie Sanders spoke 9306 total words, with a vocabulary of 2109.
- Donald Trump spoke 4224 and with a vocabulary of 1206.
- Ted Cruz spoke 4607 and with a vocabulary of 1688.
- Marco Rubio spoke 5033 and with a vocabulary of 1583.

A “heat map” of frequent words shows several interesting patterns. For instance, all candidates but one use the word “people” with high frequency. Conversely, only one candidate mentions the word “tax” frequently.

NORMALIZED WORD FREQUENCIES

Words frequencies convey differences from one candidate to the next. This is a graph of the “top” words used by all candidates, normalized by word count, \(\nu_{i} = W_{i} / \sum_{k=1}^{N} W_{k}\), where \(\nu_{i}\) is the normalized frequency of word \(i\) with count \(W_{i}\). The \(\nu_{i}\) for each candidate are plotted for the most-used words as measured for the ensemble of all candidates.
An interesting way to look at the differences in word frequencies is using Zipf’s Law to compare frequencies of words both in the overall vocabulary of the debates and the individual candidate responses. Zipf’s law states that the frequency \(\nu_{i}\) of a word is inversely proportional to its rank.
In the graph below the overall behavior (taking all the candidate speech) shows this law is followed fairly well. What’s interesting is to plot along side it the speech of the individual candidates. Zipf’s law provides a “baseline” for vocabulary usage. Since many of hte words used are the same, it’s deviation from the baseline that will provide insights into different interpretations of speech in a “bag of words” model.

CONCLUSIONS

Candidate word choices vary from candidate to candidate. While the overall speech follows expected linguistic behavior, the candidate’s usages vary remarkably. This provides some basis for believeing a “bag of words” approach can provide at least some intelligence into candidate poistions and biases. The differences appear to be most profound at higher ranking words, suggesting this might be a place to look for greater subtlety in sentiment.