The speeches are available from http://politico.com. I used the ‘rvest’ package to identify the text and bring it into R. The only other package needed is ‘qdap’. Note that I used the Chrome Extension ‘SelectorGadget’ to scrape the relevant text.
library(rvest)
library(qdap)
library(SnowballC)
If you run into an error loading ‘qdap’ then update your java version, making sure it matches R (x32 or x64).
donHTML <- read_html("http://www.politico.com/story/2016/07/full-transcript-donald-trump-nomination-acceptance-speech-at-rnc-225974")
hillHTML <- read_html("http://www.politico.com/story/2016/07/full-text-hillary-clintons-dnc-speech-226410")
SelectorGadget facilitates selecting the right html nodes
donNode <- html_nodes(donHTML, "style~ p")
hillNode <- html_nodes(hillHTML, "style~ p")
You can explore the text as you wish using html_text(). We will need to put the text into a dataframe, but there are some cleaning tasks that need to be done first.
donText <- html_text(donNode)
donText <- sub("Remarks as prepared for delivery according to a draft obtained by POLITICO Thursday afternoon.", '', donText)
donText <- sub("Story Continued Below", '', donText)
hillText <- html_text(hillNode)
hillText <- sub("Hillary Clinton's speech at the Democratic National Convention, as prepared for delivery:", '', hillText)
hillText <- sub("Story Continued Below", '', hillText)
If you end up with strange characters in your text then change the character encoding using iconv() function. The code below should do the trick.
donText <- iconv(donText, "latin1", "ASCII", "")
hillText <- iconv(hillText, "latin1", "ASCII", "")
Then this…
donText <- paste(donText, collapse = c(" ", "\n"))
hillText <- paste(hillText, collapse = c(" ", "\n"))
This is where the first ‘qdap’ function comes into play, qprep(). This function is a wrapper for a number of other replacement functions and using it will speed pre-processing, but should be used with caution if more detailed analysis is required. The functions it passes through are as follows: 1. bracketX() - apply bracket removal 2. replace_abbreviation() - changes abbreviations 3. replace_number() - numbers to words e.g. 100 becomes one hundred 4. replace_symbol() - symbols become words e.g. @ becomes ‘at’
This chunk of code does the above and also replaces contractions, removes the top 100 stopwords and strips the text of unwanted characters. Note that we will keep the period and the question marks to assist in sentence creation.
donPrep <- qprep(donText)
hillPrep <- qprep(hillText)
donPrep <- replace_contraction(donPrep)
hillPrep <- replace_contraction(hillPrep)
donRm <- rm_stopwords(donPrep, Top100Words, separate = F)
hillRm <- rm_stopwords(hillPrep, Top100Words, separate = F)
donStrip <- strip(donRm, char.keep = c("?", "."))
hillStrip <- strip(hillRm, char.keep = c("?", "."))
One of the things you can/should do is fill spaces between words, which will keep them together for the analysis such as a person’s name. The ‘keep’ list below provide an example of this and it will be used in the space_fill() function. You could include several others.
It is also now time to put both speeches into one dataframe, consisting of the text for each respective candidate.
keep <- c("United States", "Hillary Clinton", "Donald Trump", "middle class", "Supreme Court")
donFill <- data.frame(space_fill(donStrip, keep))
donFill$candidate <- "Trump"
colnames(donFill)[1] <- "text"
hillFill <- data.frame(space_fill(hillStrip, keep))
hillFill$candidate <- "Clinton"
colnames(hillFill)[1] <- "text"
df1 <- rbind(donFill, hillFill)
Critical to any analysis with the ‘qdap’ package is to put the text into sentences with the sentSplit() function. It also creates the ‘tot’ variable or ‘turn of talk’ index, which is something that would be important for analyzing the debates.
df2 <- sentSplit(df1, "text")
## Warning in sentSplit(df1, "text"): The following problems were detected:
## non character, missing ending punctuation, indicating incomplete
##
## *Consider running `check_text`
str(df2)
## Classes 'sent_split', 'qdap_df', 'sent_split_text_var:text' and 'data.frame': 660 obs. of 3 variables:
## $ candidate: chr "Trump" "Trump" "Trump" "Trump" ...
## $ tot : chr "1.1" "1.2" "1.3" "1.4" ...
## $ text : chr "friends delegates fellow americans humbly gratefully accept nomination presidency United~~States." "together lead our party back white house lead our country back safety prosperity peace." "country generosity warmth." "also country law order." ...
## - attr(*, "text.var")= chr "text"
## - attr(*, "qdap_df_text.var")= chr "text"
We’ve come to the point I think where stemming would be implemented. That is, to reduce a word to its root e.g. stems, stemming, stemmer all become stem. However, I’m not necessarily a big fan of it anymore and believe it should be applied judiciously. A number of highly experienced text miners have helped me correct the error of my former auto-stemming ways. Also, ‘qdap’ has some flexibility in comparing stemmed text versus non-stemmed text as we shall soon see.
I’ll start out with the standard word frequency analysis. As is usually the case with ‘qdap’, there are a number of options to accomplish a task. On your own have a look at the bag o words() and word_count() functions. Here I create a df of the 25 most frequent terms by candidate and compare that data in a plot.
freq <- freq_terms(df2$text)
plot(freq)
donFreq <- df2[df2$candidate == "Trump", ]
donFreq <- freq_terms(donFreq$text)
hillFreq <- df2[df2$candidate == "Clinton", ]
hillFreq <- freq_terms(hillFreq$text)
# par(mfrow=c(1,2))
plot(donFreq)
plot(hillFreq)
No surprise that Trump hits “trade”, “violence”, “immigration” and “law”. Hillary likes to talk about “us” and “me” (real shock there). Nothing about children or families?
You can create a word frequency matrix, which provides the counts for each word by speaker
wordMat <- wfm(df2$text, df2$candidate)
wordMat[c(1:5, 350:354), ]
## Clinton Trump
## abandon 0 1
## abandoned 1 1
## able 2 2
## abolish 0 1
## abroad 1 2
## crosser 0 1
## crossings 0 1
## crucial 1 0
## crushed 0 1
## crying 0 1
Of course we need to include the obligatory word cloud. In this case, I will use stemmed words
trans_cloud(df2$text, df2$candidate, stem = T, min.freq = 10)
There you have it, children and families now appear. Quite a heavy burden being engaged in what former Assistant Director of the FBI, James Kallstrom, characterized as a criminal foundation AND caring for families and children. Now that is leadership!
But I digress. A great function is ‘word_associate()’ and building word clouds based on that association. Let’s give “terror” a try.
word_associate(df2$text, df2$candidate, match.string = "terror", wordcloud = T)
## row group unit text
## 1 6 Trump 6 attacks our police terrorism our cities threaten our very life.
## 2 82 Trump 82 plan begin safety home means safe neighborhoods secure borders protection terrorism.
## 3 114 Trump 114 task our new administration liberate our citizens crime terrorism lawlessness threatens communities.
## 4 133 Trump 133 once again france victim brutal islamic terrorism.
## 5 139 Trump 139 only weeks ago orlando florida forty nine wonderful americans savagely murdered islamic terrorist.
## 6 140 Trump 140 terrorist targeted our lgbt community.
## 7 142 Trump 142 protect us terrorism need focus three things.
## 8 145 Trump 145 instead must work our allies share our goal destroying isis stamping islamic terror.
## 9 147 Trump 147 lastly must immediately suspend immigration any nation compromised terrorism until such proven vetting mechanisms put place.
## 10 328 Clinton 328 work americans our allies fight terrorism.
## 11 596 Clinton 596 should working responsible gun owners pass common sense reforms keep guns hands criminals terrorists others us harm.
##
## Match Terms
## ===========
##
## List 1:
## terrorism, terrorist, terror, terrorists
##
No commentary needed as “res ipsa loquitur”.
Comprehensive word statistics are available. Here is a plot of the stats available in the package. The plot loses some of its visual appeal with just two speakers, but it should stimulate your interest nontheless. A complete explanation of the stats is available under ?word_stats
ws <- word_stats(df2$text, df2$candidate, rm.incomplete = T)
## Warning in end_inc(dataframe = DF, text.var = text.var, ...): 17 incomplete sentence items removed
plot(ws, label = T, lab.digits = 2)
## Warning: attributes are not identical across measure variables; they will
## be dropped
Interesting the breakdown in the count of sentences and words. Hillary used a hundred more sentences, but only two hundred more words. I’m curious as to what questions they asked and how they incorporated them.
x1 <- question_type(df2$text, grouping.var = df2$candidate)
x1
## candidate tot.quest where does huh unknown
## 1 Clinton 18 0 1(5.56%) 2(11.11%) 15(83.33%)
## 2 Trump 7 3(42.86%) 1(14.29%) 0 3(42.86%)
truncdf(x1$raw)
## candidate raw.text n.row endmark strip.text q.type
## 1 Trump Our econom 38 ? our econo unknown
## 2 Trump Yet show? 46 ? yet show unknown
## 3 Trump After four 64 ? after fou unknown
## 4 Trump Every acti 131 ? every act does
## 5 Trump Where sanc 161 ? where san where
## 6 Trump Where sanc 162 ? where san where
## 7 Trump Where sanc 163 ? where san where
## 8 Clinton Stay true 313 ? stay true unknown
## 9 Clinton Really? 349 ? really unknown
## 10 Clinton Alone fix? 350 ? alone fix unknown
## 11 Clinton Forgetting 351 ? forgettin unknown
## 12 Clinton Know commu 365 ? know comm unknown
## 13 Clinton Lot looked 369 ? lot looke unknown
## 14 Clinton Big idea? 425 ? big idea unknown
## 15 Clinton Idea real? 427 ? idea real unknown
## 16 Clinton Know? 472 ? know unknown
## 17 Clinton ? 473 ? huh
## 18 Clinton ? 474 ? huh
## 19 Clinton Going done 534 ? going don unknown
## 20 Clinton Going brea 535 ? going bre unknown
## 21 Clinton Sales pitc 544 ? sales pit unknown
## 22 Clinton Put faith 545 ? put faith unknown
## 23 Clinton Ask yourse 579 ? ask yours does
## 24 Clinton Ask just s 598 ? ask just unknown
## 25 Clinton Offering? 625 ? offering unknown
OK, we’ve learned that rows 473 and 474 should be thrown out. Also looks like we have the classic use of an anaphora by Trump, which is the technique of repeating the first word or words of several consecutive sentences. I think Churchill used it quite a bit e.g. “We shall not flag or fail. We shall go on to the end. We shall fight in France, we shall…”"
df2[c(161:163), 3]
## [1] "where sanctuary kate steinle?"
## [2] "where sanctuary children mary ann sabine jamiel?"
## [3] "where sanctuary americans brutally murdered suffered horribly?"
df2[c(473:474), 3]
## [1] "?" "?"
df2 <- df2[c(-473,-474), ]
This is where it gets fun with ‘qdap’. You can tag the text by parts of speech. Check out ?pos and have a look at the vignette for further explanation https://trinker.github.io/qdap/vignettes/qdap_vignette.html
Be advised that this takes some time, which you can track with a progress bar. Notice Clinton’s use and Trump’s lack of use of interjections.
posbydf <- pos_by(df2$text, grouping.var = df2$candidate)
names(posbydf)
## [1] "text" "POStagged" "POSprop" "POSfreq"
## [5] "POSrnp" "percent" "zero.replace" "pos.by.freq"
## [9] "pos.by.prop" "pos.by.rnp"
plot(posbydf, values = T, digits = 2)
Readability scores (measures of speech complexity) are available. I won’t go into the details as I discuss this in my book and detailed information is in the ‘qdap’ vignette.
automated_readability_index(df2$text, df2$candidate)
## candidate word.count sentence.count character.count Automated_Readability_Index
## 1 Clinton 2636 391 15155 9.020
## 2 Trump 2349 267 14616 12.276
Diversity stats are a measure of language “richness” or rather, how expansive is a speakers vocabulary. The results indicate similar use of vocabulary, certainly not unusual given the assistance of professional speech writers.
diversity(df2$text, df2$candidate)
## candidate wc simpson shannon collision berger_parker brillouin
## 1 Clinton 2636 0.997 6.609 5.842 0.028 6.060
## 2 Trump 2349 0.997 6.613 5.708 0.040 6.032
Formality contextualizes the text by comparing formal parts of speech (noun, adjective, preposition and article) versus contextual parts of speech (pronoun, verb, adverb, interjection). A plot for analysis is available. Scores closer to 100 are more formal and those closer to 1 are more contextual.
form <- formality(df2$text, df2$candidate)
form
## candidate word.count formality
## 1 Trump 2363 66.55
## 2 Clinton 2651 60.68
plot(form)
Polarity measures sentence sentiment. A plot is available. What we see is that, on average, Trump was slightly more negative.
pol <- polarity(df2$text, df2$candidate)
plot(pol)
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
The lexical dispersion plot allows one to see how a word occurs throughout the text. It is interesting to view to see how topics change over time. Note that you can also include freq_terms should you so choose.
dispersion_plot(df2$text, c("immigration", "jobs", "trade", "children"), df2$candidate)
Finally, an example of a gradient wordcloud, which produces one wordcloud colored by a binary grouping variable. Let’s do one with words not stemmed and one with stemming included.
gradient_cloud(df2$text, df2$candidate, min.freq = 12, stem = F)
gradient_cloud(df2$text, df2$candidate, min.freq = 15, stem = T)
There you have it. Now go find text data, manipulate text data, analyze text data and make text-mining great again.