SUMMARY

Perception is influenced, in part, by how frequently we hear about an issue and also the context in which we hear it. This little study is a experiment to see if systematic differences in language reflecting political priorities and biases can be detected. Here, using standard NLP (Natural Lanuguage Processing) techniques, I explore this question looking for differences in the texts from recent Republican and Democratic presidential debates. Key findings are:
1. “wordcloud” visualization reveals stylistic differences between candidates but no clarity on specific postiions.
2. Word-frequencies of selected “key-words” suggest positions differences. A z-statistic and a coefficient of variance can be used to highlight signficant differences between candidates.
3. Initial results for bigram tokenization reveal differences some differences in key-word context.

DATA SOURCES AND METHODS

The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformatted .txt files. From that point all processing is done with R using capabilities of {tm} and associated libraries.

CANDIDATE WORD-CLOUDS

Wordclouds are a quick and visually apprealing method to compare texts. The {wordcloud} package in R is used here. Not surprisingly, word choices vary between candidates. However, there are also some striking and surprising similarities.

Let’s first compare the word clouds of candidates using the {wordcloud} package.

TRUMP V. SANDERS

Bernie’s word cloud is larger than Donald’s, due to having spoken more total words. (There were three major candidates at the Democratic debate and ten at Republican). What I find most surprising is the similarity of the clouds; words like “people”, “country”, and “going” are common to both. Despite strong differences in policy, word clouds reveal little about them.

c_wordcloud(trump_all)

c_wordcloud(sanders_all)

HILARY V. CARLY

In this case the word clouds couldn’t be more different. Hilary’s emphasis on “think” and “people” differs remarkably from Carly’s emphasis of “government”.

c_wordcloud(clinton_all)

c_wordcloud(fiorina_all)

CRUZ V. RUBIO

Ted Cruz’s wordcloud emphasizes technicalities like “taxes” and “washington”, while that of Marco Rubio also emphasizes taxes.
c_wordcloud(cruz_all)

c_wordcloud(rubio_all)

STAYING ON MESSAGE: COMPARING DEBATES

We can also split the text by specific debate. Since the debates cover different topics and questions, one might expect to see this reflected in the text of the separate dialogues. What’s surprising here is how comparable the language of each candidate is between the debates.

c_wordcloud(candidate_text_tc("TRUMP", r_oct))

c_wordcloud(candidate_text_tc("TRUMP", r_nov))
c_wordcloud(candidate_text_tc("SANDERS", d_oct))

c_wordcloud(candidate_text_tc("SANDERS", d_nov))

WORD FREQUENCY

We can check word frequency directly by tokenizing the text and counting single words.

Here are the five most frequent words used by the candidates in tabular form.

There are a total of 3030 words in the combined vocabulary of the candidates.

word trump sanders clinton fiorina rubio SUM
people 33 85 53 10 36 217
think 9 55 90 9 4 167
going 44 44 45 10 19 162
country 34 70 25 1 17 147
know 23 26 56 19 21 145
well 9 31 56 8 11 115
will 23 25 32 17 17 114
need 5 33 36 18 6 98
tax 12 11 4 17 28 72
government 0 7 6 40 11 64
every 4 15 9 26 8 62
SUM 1685 4314 4618 1580 2163 14360

Word counts differ widely. For instance, Carly Fiorina said “government”" a total of 40 times in her two debates, while Donald Trump didn’t say it at all. Bernie Sanders and Hilary Clinton said “think” 145 times, while the three Republican candidates say it only 23 times among them.
The total number of words spoken by Carly Fiorina was 1580 and her vocabularly of distinct words was 702. By comparison, Bernie Sanders said 4314 total words, with a vocabulary of 1375 words.

NORMALIZED WORD FREQUENCIES

From the above, there apprears to be information in comparing words frequency by one candidate to frequency of use by another. Here is a graph of the “top” words used by all candidates, normalized by word count, \(\nu_{i} = W_{i} / \sum_{k=1}^{N} W_{k}\), where \(\nu_{i}\) is the normalized frequency of word \(i\) with count \(W_{i}\).

In the graph below the \(\nu_{i}\) for each candidate are plotted for the most-used words as measured for the ensemble of all candidates.

This is much more informative. For instance, Carly Fiorina mentions the word “government” more than two percent of her word usage, whereas Donald Trump doesn’t mention the word at all. Notice that both Bernie Sanders and Donald Trump mention the word “wall” significantly more than their competitors, while Bernie Sanders alone mentions the word “street” with comparably high frequency. We’ll revisit this below.
Many of the most frequent words convey little information about candidate positions. As with the wordcloud analysis, they convey mostly style.

COEFFICIENT OF VARIATION

To highlight differences between candidates we can look at the standard deviation of the word frequencies normalized to the mean value, or the Coefficient of Variation.

Words with the highest coefficient of variation \(c_v = \sigma/\mu\), where \(\sigma\) is the standard deviation and \(\mu\) is the mean value, are apparent. These include “government”, “street” and others identified above.

KEYWORDS

A way to address the problem of “filler” words is to select for specific “key words” relevant to the topics of interest. The list below combines some “hand selected” and well as those with high coefficeint of variation.

key_words = c("tax", "government", "climate", "class", "wall", "street","terror", "economy", "immigrant", "america", "veteran", "drug", "health", "gun", "education", "bankruptcy", "money", "women", "war", "rights", "abortion", "violence", "theyre", "going", "major" )

An apparent problem is that many of the words of interest have fairly low frequencies. To better distinguish signficant differences, we can calculate a simple \(z\) statistic by taking the mean and standard deviation of the word frequencies.

This approach highlights some fairly interesting differences. For instance:
- Carly Fiorina’s use of the word “government” differs by almost two standard deviations from the other candidates.
- “tax” is used significantly more by Republicans than Democrats as is the word “money”.
- Bernie Sanders is the top user of issue words like “health”, “gun”, “economy”, and “veteran” and many others.
- “women” are mentioned by all candidates except Donald Trump.
- “wall” is mentioend significantly more by Donald Trump and Bernie Sanders than by Hilary Clinton or Carly Fiorina.

WORD ASSOCIATIONS FROM N-GRAM TOKENIZATION

Since word fequency alone does not convey context, let’s look at word associations to see if we can clarify intent and context.
To do this, let’s start with bigram tokenization of the text associated with some of the issue key words. Using the {RWeka} package we can create tables of bi- and tri-grams, which can then be searched using standard regualr expressions.

bigram_table[grep(word, rownames(bigram_table), ignore.case=TRUE)]

“TAX” IN CONTEXT

The word “tax” is heavily used by all the Republican candidates, and the context is almost identical in all cases. Carlo is the most prolific user of the word. Many of the words used with “tax”, for example “tax something”, “tax someone”, “tax everyone”, “tax money” , etc. paint a definite more aggressive image than do the words associated with “tax” by Carly Fiorina, who focuses mostly on policy terms.

Donald Trump’s choice of words paring with “tax” are similarly focused mostly on policy terms.

“WALL” IN CONTEXT

The word “wall” is used frequently by both Bernie Sanders and Donald Trump. We can clarify the context by looking at bigrams. In this case it’s clear Bernie Sanders is referring exclusively to “wall street” while Donald Trump mostly refers to his proposal to build border walls.

“THEYRE” IN CONTEXT

Donald Trump uses the word “theyre” signficantly more than other candidates. The context, as revealed by bigrams, sounds like the script of a zombie movie. “theyre going”, “theyre south”, “theyre feeding”, and “theyre coming”. The language hints toward a sentiment that, whoever “they” are, they’re a threat.

NOTE: after this work was completed, the New York Times published a story on linguistic style of Donald Trump with similar conclusions. Their study included the words both “we” and “they” (we is suppressed here as a stop word) and included a much larger amount of text.

CONCLUSIONS

Word-clouds provide insight into differences in style but do not delineate well between candiddate positions. Surprisingly opposing candidates can have very similar word clouds.
Looking at “most frequent” provides limited insight into differences between candidate positions, though many frequently used words provide no information of interest.
By looking at the co-efficient of variance and selecting for key words, we can highlight differences between candidate usages which are of greater interest.
Bigrams provide key context difference and being to hint at sentiment.

NEXT STEPS

My next step is to expand the text volume by adding more debate text. Since the data suggest candidate speech is largely consistent debate to debate, it might also be beneificial to include speech transscripts if these can be found easily online.
Another avenue is to use pre-defined word vectors to coax simiilarities from the texts. This might help narrow the