Perception is influenced, in part, by how frequently we hear about an issue and also the context in which we hear it. This little study is a experiment to see if systematic differences in language reflecting political priorities and biases can be detected. Here, using standard NLP (Natural Lanuguage Processing) techniques, I explore this question looking for differences in the texts from recent Republican and Democratic presidential debates. Key findings are:
1. “wordcloud” visualization reveals stylistic differences between candidates but no clarity on specific postiions.
2. Word-frequencies of selected “key-words” suggest positions differences. A z-statistic and a coefficient of variance can be used to highlight signficant differences between candidates.
3. Initial results for bigram tokenization reveal differences some differences in key-word context.
The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformatted .txt files. From that point all processing is done with R using capabilities of {tm} and associated libraries.
Wordclouds are a quick and visually apprealing method to compare texts. The {wordcloud} package in R is used here. Not surprisingly, word choices vary between candidates. However, there are also some striking and surprising similarities.
Let’s first compare the word clouds of candidates using the {wordcloud} package.
Bernie’s word cloud is larger than Donald’s, due to having spoken more total words. (There were three major candidates at the Democratic debate and ten at Republican). What I find most surprising is the similarity of the clouds; words like “people”, “country”, and “going” are common to both. Despite strong differences in policy, word clouds reveal little about them.
c_wordcloud(trump_all)
c_wordcloud(sanders_all)
In this case the word clouds couldn’t be more different. Hilary’s emphasis on “think” and “people” differs remarkably from Carly’s emphasis of “government”.
c_wordcloud(clinton_all)
c_wordcloud(fiorina_all)
c_wordcloud(cruz_all)
c_wordcloud(rubio_all)
We can also split the text by specific debate. Since the debates cover different topics and questions, one might expect to see this reflected in the text of the separate dialogues. What’s surprising here is how comparable the language of each candidate is between the debates.
c_wordcloud(candidate_text_tc("TRUMP", r_oct))
c_wordcloud(candidate_text_tc("TRUMP", r_nov))
c_wordcloud(candidate_text_tc("SANDERS", d_oct))
c_wordcloud(candidate_text_tc("SANDERS", d_nov))
We can check word frequency directly by tokenizing the text and counting single words.
Here are the five most frequent words used by the candidates in tabular form.
There are a total of 3030 words in the combined vocabulary of the candidates.
word | trump | sanders | clinton | fiorina | rubio | SUM |
---|---|---|---|---|---|---|
people | 33 | 85 | 53 | 10 | 36 | 217 |
think | 9 | 55 | 90 | 9 | 4 | 167 |
going | 44 | 44 | 45 | 10 | 19 | 162 |
country | 34 | 70 | 25 | 1 | 17 | 147 |
know | 23 | 26 | 56 | 19 | 21 | 145 |
well | 9 | 31 | 56 | 8 | 11 | 115 |
will | 23 | 25 | 32 | 17 | 17 | 114 |
need | 5 | 33 | 36 | 18 | 6 | 98 |
tax | 12 | 11 | 4 | 17 | 28 | 72 |
government | 0 | 7 | 6 | 40 | 11 | 64 |
every | 4 | 15 | 9 | 26 | 8 | 62 |
SUM | 1685 | 4314 | 4618 | 1580 | 2163 | 14360 |
Word counts differ widely. For instance, Carly Fiorina said “government”" a total of 40 times in her two debates, while Donald Trump didn’t say it at all. Bernie Sanders and Hilary Clinton said “think” 145 times, while the three Republican candidates say it only 23 times among them.
The total number of words spoken by Carly Fiorina was 1580 and her vocabularly of distinct words was 702. By comparison, Bernie Sanders said 4314 total words, with a vocabulary of 1375 words.
From the above, there apprears to be information in comparing words frequency by one candidate to frequency of use by another. Here is a graph of the “top” words used by all candidates, normalized by word count, \(\nu_{i} = W_{i} / \sum_{k=1}^{N} W_{k}\), where \(\nu_{i}\) is the normalized frequency of word \(i\) with count \(W_{i}\).
In the graph below the \(\nu_{i}\) for each candidate are plotted for the most-used words as measured for the ensemble of all candidates.
This is much more informative. For instance, Carly Fiorina mentions the word “government” more than two percent of her word usage, whereas Donald Trump doesn’t mention the word at all. Notice that both Bernie Sanders and Donald Trump mention the word “wall” significantly more than their competitors, while Bernie Sanders alone mentions the word “street” with comparably high frequency. We’ll revisit this below.
Many of the most frequent words convey little information about candidate positions. As with the wordcloud analysis, they convey mostly style.
To highlight differences between candidates we can look at the standard deviation of the word frequencies normalized to the mean value, or the Coefficient of Variation.
Words with the highest coefficient of variation \(c_v = \sigma/\mu\), where \(\sigma\) is the standard deviation and \(\mu\) is the mean value, are apparent. These include “government”, “street” and others identified above.
A way to address the problem of “filler” words is to select for specific “key words” relevant to the topics of interest. The list below combines some “hand selected” and well as those with high coefficeint of variation.
key_words = c("tax", "government", "climate", "class", "wall", "street","terror", "economy", "immigrant", "america", "veteran", "drug", "health", "gun", "education", "bankruptcy", "money", "women", "war", "rights", "abortion", "violence", "theyre", "going", "major" )
An apparent problem is that many of the words of interest have fairly low frequencies. To better distinguish signficant differences, we can calculate a simple \(z\) statistic by taking the mean and standard deviation of the word frequencies.
This approach highlights some fairly interesting differences. For instance:
- Carly Fiorina’s use of the word “government” differs by almost two standard deviations from the other candidates.
- “tax” is used significantly more by Republicans than Democrats as is the word “money”.
- Bernie Sanders is the top user of issue words like “health”, “gun”, “economy”, and “veteran” and many others.
- “women” are mentioned by all candidates except Donald Trump.
- “wall” is mentioend significantly more by Donald Trump and Bernie Sanders than by Hilary Clinton or Carly Fiorina.
Since word fequency alone does not convey context, let’s look at word associations to see if we can clarify intent and context.
To do this, let’s start with bigram tokenization of the text associated with some of the issue key words. Using the {RWeka} package we can create tables of bi- and tri-grams, which can then be searched using standard regualr expressions.
bigram_table[grep(word, rownames(bigram_table), ignore.case=TRUE)]
The word “tax” is heavily used by all the Republican candidates, and the context is almost identical in all cases. Carlo is the most prolific user of the word. Many of the words used with “tax”, for example “tax something”, “tax someone”, “tax everyone”, “tax money” , etc. paint a definite more aggressive image than do the words associated with “tax” by Carly Fiorina, who focuses mostly on policy terms.
Donald Trump’s choice of words paring with “tax” are similarly focused mostly on policy terms.
The word “wall” is used frequently by both Bernie Sanders and Donald Trump. We can clarify the context by looking at bigrams. In this case it’s clear Bernie Sanders is referring exclusively to “wall street” while Donald Trump mostly refers to his proposal to build border walls.
Donald Trump uses the word “theyre” signficantly more than other candidates. The context, as revealed by bigrams, sounds like the script of a zombie movie. “theyre going”, “theyre south”, “theyre feeding”, and “theyre coming”. The language hints toward a sentiment that, whoever “they” are, they’re a threat.
NOTE: after this work was completed, the New York Times published a story on linguistic style of Donald Trump with similar conclusions. Their study included the words both “we” and “they” (we is suppressed here as a stop word) and included a much larger amount of text.
Word-clouds provide insight into differences in style but do not delineate well between candiddate positions. Surprisingly opposing candidates can have very similar word clouds.
Looking at “most frequent” provides limited insight into differences between candidate positions, though many frequently used words provide no information of interest.
By looking at the co-efficient of variance and selecting for key words, we can highlight differences between candidate usages which are of greater interest.
Bigrams provide key context difference and being to hint at sentiment.
My next step is to expand the text volume by adding more debate text. Since the data suggest candidate speech is largely consistent debate to debate, it might also be beneificial to include speech transscripts if these can be found easily online.
Another avenue is to use pre-defined word vectors to coax simiilarities from the texts. This might help narrow the