Text Mining

Text Analysis in Political Speeches

The objective of this analysis is show NLP analysis capabilities beyond counting words but really analize context, content and use of language in a very useful way. Applications range from Academia, Research, Intelligence, Literary Analysis, Competitive Analisys in open sources, Copyright, etc

The speeches used are available from http://politico.com. I used the ‘rvest’ package to identify the text and bring it into R. The only other package needed is ‘qdap’. Note that I used the Chrome Extension ‘SelectorGadget’ to scrape the relevant text.

Extracting the Text from the web

library(rJava)
library(rvest)

## Loading required package: xml2

library(qdap)

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

## Loading required package: qdapTools

## Loading required package: RColorBrewer

## 
## Attaching package: 'qdap'

## The following object is masked from 'package:rvest':
## 
##     %>%

## The following object is masked from 'package:base':
## 
##     Filter

library(SnowballC)

Gather the articles and getting the label from Selector Gadget:

donHTML <- read_html("http://www.politico.com/story/2016/07/full-transcript-donald-trump-nomination-acceptance-speech-at-rnc-225974")

hillHTML <- read_html("http://www.politico.com/story/2016/07/full-text-hillary-clintons-dnc-speech-226410")

Using selector gadget to do efficient HTML analysis and extract only the chunks of useful information for this analysis

donNode <- html_nodes(donHTML, "style~ p")
hillNode <- html_nodes(hillHTML, "style~ p")

Tyding the text

You can explore the text as you wish using html_text(). We will need to put the text into a dataframe, but there are some cleaning tasks that need to be done first.

donText <- html_text(donNode)
donText <- sub("Remarks as prepared for delivery according to a draft obtained by POLITICO Thursday afternoon.", '', donText)
donText <- sub("Story Continued Below", '', donText)
hillText <- html_text(hillNode)
hillText <- sub("Hillary Clinton's speech at the Democratic National Convention, as prepared for delivery:", '', hillText)
hillText <- sub("Story Continued Below", '', hillText)

If you end up with strange characters in your text then change the character encoding using iconv() function and end of lines need to be added.

donText <- iconv(donText, "latin1", "ASCII", "")
hillText <- iconv(hillText, "latin1", "ASCII", "")
#adding end of lines
donText <- paste(donText, collapse = c(" ", "\n"))
hillText <- paste(hillText, collapse = c(" ", "\n"))

This is where the first ‘qdap’ function comes into play, qprep(). This function is a wrapper for a number of other cleaning functions and using it will speed pre-processing. The functions it passes through are as follows: 1. bracketX() - apply bracket removal 2. replace_abbreviation() - changes abbreviations 3. replace_number() - numbers to words e.g. 100 becomes one hundred 4. replace_symbol() - symbols become words e.g. @ becomes ‘at’

This chunk of code does the above and also replaces contractions, removes the top 100 stopwords and strips the text of unwanted characters. Note that we will keep the period and the question marks to assist in sentence creation for the deep speech analysis.

donPrep <- qprep(donText)
hillPrep <- qprep(hillText)

donPrep <- replace_contraction(donPrep)
hillPrep <- replace_contraction(hillPrep)

donRm <- rm_stopwords(donPrep, Top100Words, separate = F)
hillRm <- rm_stopwords(hillPrep, Top100Words, separate = F)

donStrip <- strip(donRm, char.keep = c("?", "."))
hillStrip <- strip(hillRm, char.keep = c("?", "."))

One of the things I’ll do is fill spaces between words, which will keep them together for the analysis such as a person’s name. The ‘keep’ list below provide an example of this and it will be used in the space_fill() function. You could include several others.

It is also now time to put both speeches into one dataframe, consisting of the text for each respective candidate.

keep <- c("United States", "Hillary Clinton", "Donald Trump", "middle class", "Supreme Court")
donFill <- data.frame(space_fill(donStrip, keep))
donFill$candidate <- "Trump"
colnames(donFill)[1] <- "text"
hillFill <- data.frame(space_fill(hillStrip, keep))
hillFill$candidate <- "Clinton"
colnames(hillFill)[1] <- "text"
df1 <- rbind(donFill, hillFill)

Critical to any analysis with the ‘qdap’ package is to put the text into sentences with the sentSplit() function. It also creates the ‘tot’ variable or ‘turn of talk’ index, which is something that would be important for analyzing the debates. Analysis of dialogues is very easy with this package

df2 <- sentSplit(df1, "text")

## Warning in sentSplit(df1, "text"): The following problems were detected:
## non character, missing ending punctuation, indicating incomplete
## 
## *Consider running `check_text`

str(df2)

## Classes 'sent_split', 'qdap_df', 'sent_split_text_var:text' and 'data.frame':    660 obs. of  3 variables:
##  $ candidate: chr  "Trump" "Trump" "Trump" "Trump" ...
##  $ tot      : chr  "1.1" "1.2" "1.3" "1.4" ...
##  $ text     : chr  "friends delegates fellow americans humbly gratefully accept nomination presidency United~~States." "together lead our party back white house lead our country back safety prosperity peace." "country generosity warmth." "also country law order." ...
##  - attr(*, "text.var")= chr "text"
##  - attr(*, "qdap_df_text.var")= chr "text"

We’ve come to the point I think where stemming would be implemented. That is, to reduce a word to its root e.g. stems, stemming, stemmer all become stem.‘Qdap’ has some flexibility in comparing stemmed text versus non-stemmed text as we shall soon see.

Preliminary analysis

Word Count Plots

I’ll start out with the standard word frequency analysis. Using the bag o words() and word_count() functions. Here I create a df of the 25 most frequent terms by candidate and compare that data in a plot.

freq <- freq_terms(df2$text)
plot(freq)

donFreq <- df2[df2$candidate == "Trump", ]
donFreq <- freq_terms(donFreq$text)
hillFreq <- df2[df2$candidate == "Clinton", ]
hillFreq <- freq_terms(hillFreq$text)
# par(mfrow=c(1,2))
plot(donFreq)

plot(hillFreq)

No surprise that Trump hits “trade”, “violence”, “immigration” and “law”. Hillary likes to talk about “us” and “me” and surprisingly uses “Wall” more than Trump in the acceptance speech. Nothing about children or families?

Word Frequency Matrix

Creating a word frequency matrix, which provides the counts for each word by speaker

wordMat <- wfm(df2$text, df2$candidate)
wordMat[c(1:5, 350:354), ]

##           Clinton Trump
## abandoned       1     1
## able            2     2
## abroad          1     2
## accept          1     1
## access          1     1
## facts           1     3
## failedand       1     0
## fair            2     3
## fairer          1     0
## faith           3     0

Word Cloud Stemmed Words

Now generating a word cloud, will use stemmed words

trans_cloud(df2$text, df2$candidate, stem = T, min.freq = 10, title=TRUE)

There you have it, children and families now appear on Clinton’s narrative making double down in topics highlighting “work”, “believe”, family. In the case of Trump the number of topics are less and focus on “country” , “america”, “nation” draw his topic line

Word Association

A great function is ‘word_associate()’ and building word clouds based on that association. Let’s give “terror”, “wall” and “Deal” a try. Selection of term is a little Biased to make the example clear

word_associate(df2$text, df2$candidate, match.string = "wall", wordcloud = T)

##   row   group unit text                                                                                                          
## 1 167   Trump  167 going build great border wall stop illegal immigration stop gangs violence stop drugs pouring our communities.
## 2 325 Clinton  325 build wall.                                                                                                   
## 3 494 Clinton  494 believe wall street never ever allowed wreck main street again.                                               
## 4 529 Clinton  529 heres wall street corporations super rich going start paying fair share taxes.

## 
## Match Terms
## ===========

## 
## List 1:
## wall

##

word_associate(df2$text, df2$candidate, match.string = "terror", wordcloud = T)

##    row   group unit text                                                                                                                        
## 1    6   Trump    6 attacks our police terrorism our cities threaten our very life.                                                             
## 2   82   Trump   82 plan begin safety home means safe neighborhoods secure borders protection terrorism.                                        
## 3  114   Trump  114 task our new administration liberate our citizens crime terrorism lawlessness threatens communities.                        
## 4  133   Trump  133 once again france victim brutal islamic terrorism.                                                                          
## 5  139   Trump  139 only weeks ago orlando florida forty nine wonderful americans savagely murdered islamic terrorist.                          
## 6  140   Trump  140 terrorist targeted our lgbt community.                                                                                      
## 7  142   Trump  142 protect us terrorism need focus three things.                                                                               
## 8  145   Trump  145 instead must work our allies share our goal destroying isis stamping islamic terror.                                        
## 9  147   Trump  147 lastly must immediately suspend immigration any nation compromised terrorism until such proven vetting mechanisms put place.
## 10 328 Clinton  328 work americans our allies fight terrorism.                                                                                  
## 11 596 Clinton  596 should working responsible gun owners pass common sense reforms keep guns hands criminals terrorists others us harm.

## 
## Match Terms
## ===========

## 
## List 1:
## terrorism, terrorist, terror, terrorists

##

word_associate(df2$text, df2$candidate, match.string = "deal", wordcloud = T)

##    row   group unit text                                                                                                                                                            
## 1   51   Trump   51 just prior signing iran deal gave back iran dollar hundred fifty billion gave us nothing history worst deals ever.                                              
## 2   93   Trump   93 visited laid off factory workers communities crushed our horrible unfair trade deals.                                                                           
## 3  185   Trump  185 billions dollars business making deals im going our country rich again.                                                                                         
## 4  187   Trump  187 america lost nearly third manufacturing jobs since thousand nine hundred ninety seven following enactment disastrous trade deals supported bill Hillary Clinton.
## 5  188   Trump  188 remember bill clinton signed nafta worst economic deals ever our country.                                                                                       
## 6  193   Trump  193 supported job killing trade deal south korea.                                                                                                                   
## 7  197   Trump  197 instead individual deals individual countries.                                                                                                                  
## 8  198   Trump  198 longer enter massive deals countries thousands pages our country even reads understands.                                                                        
## 9  202   Trump  202 includes renegotiating nafta much better deal america well walk away dont deal want.                                                                            
## 10 209   Trump  209 going deal issue regulation greatest job killers.                                                                                                               
## 11 433 Clinton  433 big deal.                                                                                                                                                       
## 12 434 Clinton  434 should big deal president.                                                                                                                                      
## 13 503 Clinton  503 believe should say unfair trade deals|                                                                                                                          
## 14 528 Clinton  528 know fighting affordable child care paid family leave playing woman card deal me heres thing only going investments going pay every single.                     
## 15 556 Clinton  556 baghdad kabul nice paris brussels san bernardino orlando dealing determined enemies must defeated.

## 
## Match Terms
## ===========

## 
## List 1:
## deal, deals, dealing

##

No commentary needed very self explanatory on where the candidates focus or change connotation of the selected target word“.

Word Stats

A complete explanation of the stats is available under ?word_stats

Here a quick Reference

n.tot - number of turns of talk
n.sent - number of sentences
n.words - number of words
n.char - number of characters
n.syl - number of syllables
n.poly - number of polysyllables
sptot - syllables per turn of talk
wptot - words per turn of talk
wps - words per sentence
cps - characters per sentence
sps - syllables per sentence
psps - poly-syllables per sentence
cpw - characters per word
spw - syllables per word
n.state - number of statements
n.quest - number of questions
n.exclm - number of exclamations
n.incom - number of incomplete statements
p.state - proportion of statements
p.quest - proportion of questions
p.exclm - proportion of exclamations
p.incom - proportion of incomplete statements
n.hapax - number of hapax legomenon
n.dis - number of dis legomenon
grow.rate - proportion of hapax legomenon to words
prop.dis - proportion of dis legomenon to words

ws <- word_stats(df2$text, df2$candidate, rm.incomplete = T)

## Warning in end_inc(dataframe = DF, text.var = text.var, ...): 17 incomplete sentence items removed

plot(ws, label = T, lab.digits = 2)

## Warning: attributes are not identical across measure variables; they will
## be dropped

## Warning: Ignoring unknown aesthetics: fill

Interesting the breakdown in the count of sentences and words. Hillary used a hundred more sentences, but only two hundred more words. I’m curious as to what questions they asked and how they incorporated them. Without analysis we should had think that Trump will use less words or polysylable words which is not true (still questioning if those words are relevant or not)

Question Extraction

The next analysis allows you to extract alll the questions made during the speech (yes mostly rethorical), and let you know the type of question

x1 <- question_type(df2$text, grouping.var = df2$candidate)
x1

##   candidate tot.quest     where      does       huh    unknown
## 1   Clinton        18         0  1(5.56%) 2(11.11%) 15(83.33%)
## 2     Trump         7 3(42.86%) 1(14.29%)         0  3(42.86%)

truncdf(x1$raw)

##    candidate   raw.text n.row endmark strip.text  q.type
## 1      Trump Our econom    38       ?  our econo unknown
## 2      Trump  Yet show?    46       ?  yet show  unknown
## 3      Trump After four    64       ?  after fou unknown
## 4      Trump Every acti   131       ?  every act    does
## 5      Trump Where sanc   161       ?  where san   where
## 6      Trump Where sanc   162       ?  where san   where
## 7      Trump Where sanc   163       ?  where san   where
## 8    Clinton Stay true    313       ?  stay true unknown
## 9    Clinton    Really?   349       ?    really  unknown
## 10   Clinton Alone fix?   350       ?  alone fix unknown
## 11   Clinton Forgetting   351       ?  forgettin unknown
## 12   Clinton Know commu   365       ?  know comm unknown
## 13   Clinton Lot looked   369       ?  lot looke unknown
## 14   Clinton  Big idea?   425       ?  big idea  unknown
## 15   Clinton Idea real?   427       ?  idea real unknown
## 16   Clinton      Know?   472       ?      know  unknown
## 17   Clinton          ?   473       ?                huh
## 18   Clinton          ?   474       ?                huh
## 19   Clinton Going done   534       ?  going don unknown
## 20   Clinton Going brea   535       ?  going bre unknown
## 21   Clinton Sales pitc   544       ?  sales pit unknown
## 22   Clinton Put faith    545       ?  put faith unknown
## 23   Clinton Ask yourse   579       ?  ask yours    does
## 24   Clinton Ask just s   598       ?  ask just  unknown
## 25   Clinton  Offering?   625       ?  offering  unknown

OK, we’ve learned that rows 473 and 474 should be thrown out according to the output of the type of question. Also looks like we have the classic use of an anaphora by Trump, which is the technique of repeating the first word or words of several consecutive sentences. I think Churchill used it quite a bit e.g. “We shall not flag or fail. We shall go on to the end. We shall fight in France, we shall.”"

df2[c(161:163), 3]

## [1] "where sanctuary kate steinle?"                                 
## [2] "where sanctuary children mary ann sabine jamiel?"              
## [3] "where sanctuary americans brutally murdered suffered horribly?"

df2[c(473:474), 3]

## [1] "?" "?"

df2 <- df2[c(-473,-474), ]

Advanced analysis

NLP Wrapper

The pos functions are wrappers from openNLP. Adding a dictionary for reference and interpretation

##    Tag  Description                             
## 1  CC   Coordinating conjunction                
## 2  CD   Cardinal number                         
## 3  DT   Determiner                              
## 4  EX   Existential there                       
## 5  FW   Foreign word                            
## 6  IN   Preposition or subordinating conjunction
## 7  JJ   Adjective                               
## 8  JJR  Adjective, comparative                  
## 9  JJS  Adjective, superlative                  
## 10 LS   List item marker                        
## 11 MD   Modal                                   
## 12 NN   Noun, singular or mass                  
## 13 NNS  Noun, plural                            
## 14 NNP  Proper noun, singular                   
## 15 NNPS Proper noun, plural                     
## 16 PDT  Predeterminer                           
## 17 POS  Possessive ending                       
## 18 PRP  Personal pronoun                        
## 19 PRP$ Possessive pronoun                      
## 20 RB   Adverb                                  
## 21 RBR  Adverb, comparative                     
## 22 RBS  Adverb, superlative                     
## 23 RP   Particle                                
## 24 SYM  Symbol                                  
## 25 TO   to                                      
## 26 UH   Interjection                            
## 27 VB   Verb, base form                         
## 28 VBD  Verb, past tense                        
## 29 VBG  Verb, gerund or present participle      
## 30 VBN  Verb, past participle                   
## 31 VBP  Verb, non-3rd person singular present   
## 32 VBZ  Verb, 3rd person singular present       
## 33 WDT  Wh-determiner                           
## 34 WP   Wh-pronoun                              
## 35 WP$  Possessive wh-pronoun                   
## 36 WRB  Wh-adverbvvv

Be advised that this takes some time, which you can track with a progress bar. Notice Clinton’s use and Trump’s lack of use of interjections.

posbydf <- pos_by(df2$text, grouping.var = df2$candidate)
names(posbydf)

##  [1] "text"         "POStagged"    "POSprop"      "POSfreq"     
##  [5] "POSrnp"       "percent"      "zero.replace" "pos.by.freq" 
##  [9] "pos.by.prop"  "pos.by.rnp"

plot(posbydf, values = T, digits = 2)

## Warning: Ignoring unknown aesthetics: fill

Readability Score

Readability scores were originally designed to measure the difficulty of text. Scores are generally based on, number of words, syllables, polly-syllables and word length. While these scores are not specifically designed for, or tested on, speech, they can be useful indicators of speech complexity.

automated_readability_index(df2$text, df2$candidate)

##   candidate word.count sentence.count character.count Automated_Readability_Index
## 1   Clinton       2636            391           15155                       9.020
## 2     Trump       2349            267           14616                      12.276

Linguistical Diversity Stats

Diversity stats are a measure of language “richness” or rather, how expansive is a speakers vocabulary. The results indicate similar use of vocabulary, certainly not unusual given the assistance of professional speech writers.

diversity(df2$text, df2$candidate)

##   candidate   wc simpson shannon collision berger_parker brillouin
## 1   Clinton 2636   0.997   6.609     5.842         0.028     6.060
## 2     Trump 2349   0.997   6.613     5.708         0.040     6.032

Formal or Contextual?

Formality contextualizes the text by comparing formal parts of speech (noun, adjective, preposition and article) versus contextual parts of speech (pronoun, verb, adverb, interjection). A plot for analysis is available. Scores closer to 100 are more formal and those closer to 1 are more contextual.

form <- formality(df2$text, df2$candidate)
form

##   candidate word.count formality
## 1     Trump       2363     66.55
## 2   Clinton       2651     60.68

plot(form)

Polarity Measures AKA Sentiment Analysis

Polarity measures sentence sentiment. A plot is available. What we see is that, on average, Trump was slightly more negative.

pol <- polarity(df2$text, df2$candidate)
plot(pol)

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

## Warning: Ignoring unknown aesthetics: x

## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.

Lexical dispersion

The lexical dispersion plot allows one to see how a word occurs throughout the text. It is interesting to view to see how topics change over time. Note that you can also include freq_terms should you so choose.

dispersion_plot(df2$text, c("immigration", "jobs", "trade", "children", "wall"),     df2$candidate)

Stemmed Vs UnStemmed

Finally, an example of a gradient wordcloud, which produces one wordcloud colored by a binary grouping variable. Let’s do one with words not stemmed and one with stemming included.

gradient_cloud(df2$text, df2$candidate, min.freq = 12, stem = F)

gradient_cloud(df2$text, df2$candidate, min.freq = 15, stem = T)

Conclusion

After the analysis you may have surprising insights or how close the language used, structure and emphasis on messaging is so close to each other (tha candidates) hence how this speeches where note determinant for help the elctorate to make a choice since clarity of difference in the proposals wasn’t there at least at the very beginning of the campaign.