Project #2 Overview

For this project, the goal is to practice simple text mining methods by using Former President Donald Trump’s speeches as the text data.

Here is a link to the Speeches made by Donald Trump during his presidency.

Libraries

Here are some packages needed throughout the project. Some will be needed throughout the whole project. Others are needed for specific tasks later in the assignment.

library(tidyverse)
library(tm)     # text mining
library(lsa)    # to perform Latent Semantic Analysis
library(wordcloud) # to construct wordclouds
library(caret)   # to make Confusion Matrix
library(gains)   # to create Lift Chart and ROC curve

Data Entry

Before beginning, we need to copy and paste each of Trump’s Speeches from millercenter.org

When creating the .txt files, I excluded:

More specifically I excluded:

As a result, 40 .txt files were created.
The 40 .txt files were then separated into two folders, both containing 20 files.
One folder holds all speeches from 2017, 2018, and 2019.
Another folder holds all speeches from 2020 and 2021, which were busy years for speeches due to the global coronavirus pandemic as well as other things.

Input Speech Files

So let’s get the .txt speech files form both folders.

# Path to Folder 1 (2017-2019)
folder1 = "/Users/Simbo/Desktop/School/STAT415/Homework/Trump_Speeches/2017_2018_2019/"
# Getting names of all files in folder 1
folder1.files = list.files(folder1)
folder1.files = paste0(folder1, folder1.files)

# For each folder, read in all lines that are non-empty, 
# not including the first three which conttain the title and date (not words spoken by Trump)
folder1.speeches <- folder1.files %>% 
                      lapply(FUN=read_lines, skip = 3, skip_empty_rows = TRUE) %>% 
                      lapply(FUN=function(f) paste(f, collapse = "")) %>% 
                      unlist()

# Repeat with speeches in the second folder (2020-2021)
folder2 = "/Users/Simbo/Desktop/School/STAT415/Homework/Trump_Speeches/2020_2021/"
folder2.files = list.files(folder2)
folder2.files = paste0(folder2, folder2.files)

folder2.speeches <- folder2.files %>% 
                      lapply(FUN=read_lines, skip = 3, skip_empty_rows = TRUE) %>% 
                      lapply(FUN=function(f) paste(f, collapse = "")) %>% 
                      unlist()

# Combine all years into one
speeches = c(folder1.speeches, folder2.speeches)

# Create labels: 0 = from folder 1 (2017-19), 1 = from folder 2 (2020-21)
label = rep(c(0, 1), c(length(folder1.speeches), length(folder2.speeches)))

Create a Corpus

In text mining, a corpus is a (often) large colleciton of text.
The function Corpus() from the tm library creates a Corpus from the documents in our folders.
Our Corpus will be labeled corp to be used throughout this project

corp <- VectorSource(speeches) %>% Corpus()

tdm <- TermDocumentMatrix(corp)
inspect(tdm)
## <<TermDocumentMatrix (terms: 16103, documents: 40)>>
## Non-/sparse entries: 43975/600145
## Sparsity           : 93%
## Maximal term length: 29
## Weighting          : term frequency (tf)
## Sample             :
##       Docs
## Terms   12  16  19  20  24  25   3  32   5   8
##   and  173 201 187 217 206 270 313 373 164 165
##   are   33  32  48  34  64  33  66  65  30  42
##   for   54  71  33  48  62  51  87  84  43  57
##   have  44  25  40  76  49 102 137 117  50  43
##   our   98  60  96  64  91  27 100  77  62  93
##   that  57  63  51  93  65 119 126 131  66  54
##   the  219 279 218 194 273 326 295 434 183 258
##   they  22  10  24  50  12 140 148 209  69  15
##   will  53  27  34  12  34   9  29  69  25  29
##   you    9   2  14  50  49 175 170 197  70  11

Tokenization

Tokenization is the process of taking text and separating it into “tokens” or terms.
In this case, we will remove white space, punctuation, and numbers using the tm_map() function.

corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)

Additionally many text mining softwares will have varying lists of words that can be removed without losing too much information. These words are called stopwords.
The tm library has a list of 174 stopwords, mostly consisting of pronouns, helping verbs, and prepositions.
These words might add noise and do have really high frequencies in our documents due to English sentence structures. We will rermove these words from our Corpus using the same tm_map() function.

corp <- tm_map(corp, removeWords, stopwords("english"))

Another technique for narrowing down the number of terms is to look at the frequency of terms in the documents.
For example, getting rid of terms that occur frequently across all documents or incredibly rare terms.
This leads to the idea of Term-Frequency-Inverse Document Frequency (TF-IDF). TF-IDF looks for documents with many occurrences of rare terms.
Once again different text mining softwares will have different ways fo computing TF-IDF. In the tm package, the function is weightTFIdf(). But before we can use this function, the Corpus needs to be turned into a Term Document Matrix.

tdm <- TermDocumentMatrix(corp)
dim(tdm)
## [1] 11199    40
inspect(tdm)
## <<TermDocumentMatrix (terms: 11199, documents: 40)>>
## Non-/sparse entries: 34784/413176
## Sparsity           : 92%
## Maximal term length: 24
## Weighting          : term frequency (tf)
## Sample             :
##           Docs
## Terms      12 16 19 20 24  25   3  32  5  8
##   —         0  0  3 13  2 220 125   0  6  1
##   ’re       0  0  0 55  1  43  73 102 42  1
##   ’s        0  2  0 38  6  97  63 109 30  0
##   american 27 12 30 14 35   0   6  20 13 33
##   and       4  6 12 67 18 115 122 103 54  6
##   going     3  1  4 42  5  58  70  65 28  3
##   great    13  8 13 22  9  48  44  62 23  6
##   people   11 42 22 34 19  67 101  93 27  8
##   want      9  5  9 25  6  32  74  60 26  1
##   will     54 29 34 14 34   9  29  69 25 29

Now that we have our reduced Corpus as a Term Document Matrix, we can look at the Term-Frequency - Inverse Document Frequency (TF-IDF)
The terms are the phrases that are generally rare, but in a few speeches (denoted by the doc #) are used very frequently.

tfidf <- weightTfIdf(tdm)
inspect(tfidf)
## <<TermDocumentMatrix (terms: 11199, documents: 40)>>
## Non-/sparse entries: 34704/413256
## Sparsity           : 92%
## Maximal term length: 24
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##          Docs
## Terms               10          12          13           19           24
##   –       0.0000000000 0.035581609 0.033814487 0.0000000000 0.0000000000
##   —       0.0016823971 0.000000000 0.000000000 0.0005521171 0.0002894112
##   ’re     0.0002648485 0.000000000 0.000000000 0.0000000000 0.0001594602
##   ’s      0.0000000000 0.000000000 0.000000000 0.0000000000 0.0009567614
##   barrett 0.0000000000 0.000000000 0.000000000 0.0000000000 0.0000000000
##   don’t   0.0000000000 0.000000000 0.000000000 0.0000000000 0.0008289827
##   iran    0.0034796758 0.000000000 0.000000000 0.0019983868 0.0000000000
##   lot     0.0000000000 0.000000000 0.000000000 0.0000000000 0.0000000000
##   nations 0.0078307748 0.001568499 0.003875564 0.0002855387 0.0006735372
##   virus   0.0000000000 0.000000000 0.000000000 0.0000000000 0.0000000000
##          Docs
## Terms              26          30           31           36            8
##   –       0.000000000 0.000000000 0.0014123650 0.0000000000 0.0000000000
##   —       0.000000000 0.000000000 0.0028170064 0.0000000000 0.0001784853
##   ’re     0.000000000 0.000000000 0.0000000000 0.0016604410 0.0001966842
##   ’s      0.000000000 0.000000000 0.0010347456 0.0008302205 0.0000000000
##   barrett 0.000000000 0.000000000 0.0000000000 0.1090878380 0.0000000000
##   don’t   0.000000000 0.000000000 0.0013448277 0.0000000000 0.0002556246
##   iran    0.000000000 0.000000000 0.0000000000 0.0000000000 0.0006460272
##   lot     0.000000000 0.000000000 0.0000000000 0.0000000000 0.0000000000
##   nations 0.001119728 0.000000000 0.0004856242 0.0000000000 0.0008307660
##   virus   0.031346364 0.004262006 0.0022658108 0.0000000000 0.0000000000

Now let’s create a barplot and wordcloud to show the words that are rare in all of the speeches, but frequent in just a few speeches.

v = rowSums(inspect(tdm))
## <<TermDocumentMatrix (terms: 11199, documents: 40)>>
## Non-/sparse entries: 34784/413176
## Sparsity           : 92%
## Maximal term length: 24
## Weighting          : term frequency (tf)
## Sample             :
##           Docs
## Terms      12 16 19 20 24  25   3  32  5  8
##   —         0  0  3 13  2 220 125   0  6  1
##   ’re       0  0  0 55  1  43  73 102 42  1
##   ’s        0  2  0 38  6  97  63 109 30  0
##   american 27 12 30 14 35   0   6  20 13 33
##   and       4  6 12 67 18 115 122 103 54  6
##   going     3  1  4 42  5  58  70  65 28  3
##   great    13  8 13 22  9  48  44  62 23  6
##   people   11 42 22 34 19  67 101  93 27  8
##   want      9  5  9 25  6  32  74  60 26  1
##   will     54 29 34 14 34   9  29  69 25 29
term = names(v)
occurrences = as.numeric(v)

data.frame(term, occurrences) %>%
  ggplot(mapping = aes(x = term, y = occurrences)) + 
    geom_col()

wordcloud(term, occurrences, random.color = TRUE, colors=c("red", "blue"))

Now using the lsa package, we can perform Latent Semantic Analysis (LSA), which is a dimension reduction method for text data.
Simple LSA cannot extract meaning, but rather classifies and summarizes the text data for us to maybe interpret.

lsa.tfidf <- lsa(tfidf, dim = 3)

words.df <- as.data.frame(as.matrix(lsa.tfidf$dk))
words.df
##             V1          V2            V3
## 1  -0.05102351  0.05482171 -0.0806071065
## 2  -0.10189353  0.12063965 -0.1118138483
## 3  -0.04302644  0.03124660 -0.0440969958
## 4  -0.04322897  0.03330719 -0.0473548506
## 5  -0.03735111  0.02817525 -0.0393914866
## 6  -0.04184048  0.04661754 -0.0571587306
## 7  -0.05547233  0.04058611 -0.0556360103
## 8  -0.04139433  0.03668848 -0.0450326236
## 9  -0.04506322  0.02808556 -0.0500809428
## 10 -0.04483540  0.04843958 -0.0497179615
## 11 -0.04285311  0.03171454 -0.0514196196
## 12 -0.05847241  0.04372035 -0.0735217321
## 13 -0.13211441  0.18342112 -0.5992217734
## 14 -0.03892498  0.02978269 -0.0407763558
## 15 -0.04441197  0.03483385 -0.0475124489
## 16 -0.04749302  0.04586913 -0.0670090476
## 17 -0.04313464  0.05108982 -0.0460703211
## 18 -0.04097318  0.04290354 -0.0468080721
## 19 -0.04312258  0.03827043 -0.0465637678
## 20 -0.04130346  0.02905263 -0.0409199341
## 21 -0.11671794  0.17785525 -0.4967186587
## 22 -0.07266571  0.09523233 -0.2196612726
## 23 -0.05728816  0.04297404 -0.0668835917
## 24 -0.05259556  0.02958615 -0.0494645199
## 25 -0.04288223  0.02708437 -0.0442598923
## 26 -0.05452260  0.05905837 -0.0685567827
## 27 -0.04887368  0.04498025 -0.0567552522
## 28 -0.05082735  0.02791702 -0.0424874136
## 29 -0.04944533  0.03530515 -0.0537599079
## 30 -0.09630091  0.12639236 -0.0959279431
## 31 -0.05415104  0.03036965 -0.0497909872
## 32 -0.04438702  0.03209880 -0.0445522414
## 33 -0.04536829  0.03406534 -0.0432989122
## 34 -0.04592037  0.03806380 -0.0441157329
## 35 -0.03917059  0.02919905 -0.0460048273
## 36 -0.87123622 -0.45941903  0.1199470573
## 37 -0.06722492  0.05813362 -0.0477721686
## 38 -0.31071756  0.77009310  0.4728881367
## 39 -0.12745066  0.20285322  0.0003989346
## 40 -0.04838010  0.04486497 -0.0428469821

Predictive Modeling and Confusion Matrix

Now let’s produce a confusion matrix by first splitting the model into two parts: a training data set and a validation data set

n = nrow(words.df)

# Create a training set with 60% of thedocuments
idx = sample(1:n)
training = idx[1:round(n*0.6)]

# Add labels
trainingData = cbind(label = label[training], words.df[training, ])
# Fit a generalized linear model
reg = glm(label ~ ., data = trainingData, family = "binomial")
# validation data
validData = cbind(label = label[-training], words.df[-training, ])
# Predict with validation data
pred = predict(reg, newdata = validData, type = "response") # Fitted probabilities

# Produce the confusion matrix
pred.label = factor((as.numeric(pred) > 0.5) * 1, levels = c(1,0))
act = label[-training] %>% factor(levels = c(1,0))

confusionMatrix(pred.label, act)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 0
##          1 5 4
##          0 1 6
##                                           
##                Accuracy : 0.6875          
##                  95% CI : (0.4134, 0.8898)
##     No Information Rate : 0.625           
##     P-Value [Acc > NIR] : 0.4067          
##                                           
##                   Kappa : 0.3939          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.5556          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.3750          
##          Detection Rate : 0.3125          
##    Detection Prevalence : 0.5625          
##       Balanced Accuracy : 0.7167          
##                                           
##        'Positive' Class : 1               
##