Introduction

On September 27th, 2018, Brett Kavanaugh gave his opening statement during his SCOTUS hearing. The past few days I was thinking about whether there were any patterns within Kavanaugh’s testimony. Given that I recently finished a course on text mining through Coursera, I figured I’d try out my new skill on a different data set.

This analysis is of Kavanaugh’s opening statement, and includes frequent term analysis, a few word clouds, and sentiment analysis.

Process & Steps

I used the following packages: tm, wordcloud, ggplot2, RWeka, and stringi

Creation of a Corpus

A corpus is a collection of text or documents. The basic process is to load in the text from a file, and then cleanse the text and convert it into a corpus. I used the {tm} package for creating the corpus.

Steps:

  1. Set seed
  2. Set working directory
  3. Read all lines from Kavanaugh’s opening statement.
  4. Build a corpus and cleanse it. Cleaning involves removing punctuation, conversion to all lower case, remove numbers, etc.
set.seed(2972) # for repeatability

# Set working directory for Rich's machine.
setwd("D:/Data/kavanaugh_testimony")

con <- file("kav_opening.txt", "rb")
lines <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

data <- gsub("\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b", "", lines)

corpus <- VCorpus(VectorSource(data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)

toSpace <- content_transformer(function(x, pattern) gsub(pattern, "", x))
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, toSpace, "[^ a-zA-Z&-]|[&-]{2,}")
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "[^\x01-\x7F]") 

Get the word counts/frequencies

Once we have a corpus created, we can generate the anagrams. Anagrams, or n-Grams are essentially word combinations. I focused on bigrams, trigrams, and quadgrams, which are 2- and 3-, and 4-word combinations respectively.

Steps:

  1. Create a function that generates the frequency list of the nGrams (unigrams, bigrams, trigrams, and quadgrams)
  2. Create the nGram tokenizer using the RWeka package (functions NGramTokenizer and Weka_control)
  3. When creating nGrams, we also remove any sparse terms – or very low occuring frequency terms.
  4. Create the term document matrix (tdm). The tdm ultimately contains a listing (matrix!) of all the tri-grams or bigrams.
counts <- function(t) {
  f <- sort(rowSums(as.matrix(t)), decreasing = TRUE)
  return(data.frame(word = names(f), freq = f))
}
bi_gram <- function(z) NGramTokenizer(z, Weka_control(min = 2, max = 2))
tri_gram <- function(z) NGramTokenizer(z, Weka_control(min = 3, max = 3))
quad_gram <- function(z) NGramTokenizer(z, Weka_control(min = 4, max = 4))

freq_uni <- counts(removeSparseTerms(TermDocumentMatrix(corpus), 0.99))
freq_bi <- counts(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bi_gram)), 0.999))
freq_tri <- counts(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = tri_gram)), 0.999))
freq_quad <- counts(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = quad_gram)), 0.999))

tdm_unigram <- removeSparseTerms(TermDocumentMatrix(corpus), 0.99)
unigram_levels <- unique(tdm_unigram$dimnames$Terms)
tdm_bigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bi_gram)), 0.99)
tdm_trigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = tri_gram)), 0.999)

Display Frequencies

gimme_Plot <- function(data, label, co) {
  ggplot(data[1:20,], aes(reorder(word, -freq), freq)) +
    labs(x = label, y = "Count/Frequency") +
    theme(axis.text.x = element_text(angle = 90, size = 10, hjust = 1)) +
    geom_bar(stat = "identity", fill = I(co)) }

gimme_Plot(freq_uni, "Unigrams", "Red")  

gimme_Plot(freq_bi, "Bigrams", "Purple")

gimme_Plot(freq_tri, "Trigrams", "Blue")

gimme_Plot(freq_quad, "Quadgrams", "Gray")

Top Words

It would seem that by examining the trigrams, the most frequently occurring phrases are:

Top Word from Unigram

The top word is: the. Its frequency was 227.

Top Words from Bigrams

For bigrams, the top bigram is: i was. Its frequency was 34.

Top Words from Trigrams

The top 3 trigrams are:

For trigrams, the top trigram is: the summer of. Its frequency was 9.

The 2nd most frequent occurring trigram is in high school and its frequency is 8.

Finally, the 3rd most occurring trigram is the supreme court and its frequency is 7.

Quadgram

The most frequently occurring quadgram was in the summer of and its frequency is 5.

Frequent Terms Listing

Here’s another look at frequent terms used during the opening statement…

fft <- findFreqTerms(tdm_unigram, 4, 5)
fft
##  [1] "allegedly"     "ashley"        "assault"       "assaulted"    
##  [5] "august"        "basketball"    "because"       "beers"        
##  [9] "being"         "both"          "camp"          "character"    
## [13] "clerk"         "college"       "come"          "confirmation" 
## [17] "confirmed"     "dad"           "decades"       "described"    
## [21] "destroy"       "didnt"         "drive"         "due"          
## [25] "during"        "every"         "evil"          "female"       
## [29] "first"         "fords"         "former"        "friend"       
## [33] "girls"         "got"           "happened"      "his"          
## [37] "ill"           "investigation" "judges"        "kavanaugh"    
## [41] "lawyers"       "lifetime"      "line"          "listed"       
## [45] "long"          "make"          "media"         "most"         
## [49] "names"         "national"      "occurred"      "old"          
## [53] "only"          "other"         "person"        "point"        
## [57] "political"     "present"       "put"           "recall"       
## [61] "record"        "refuted"       "respect"       "say"          
## [65] "senate"        "senator"       "service"       "show"         
## [69] "side"          "sometimes"     "started"       "such"         
## [73] "talking"       "text"          "thank"         "then"         
## [77] "think"         "took"          "training"      "under"        
## [81] "white"         "work"          "wrote"         "year"

Word Clouds

This word cloud could be enlightening.

wordcloud(words = freq_bi$word, freq = freq_bi$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.25, 
          colors=brewer.pal(8, "Set1"))