For this project, the goal is to practice simple text mining methods by using Former President Donald Trump’s speeches as the text data.
Here is a link to the Speeches made by Donald Trump during his presidency.
Here are some packages needed throughout the project. Some will be needed throughout the whole project. Others are needed for specific tasks later in the assignment.
library(tidyverse)
library(tm) # text mining
library(lsa) # to perform Latent Semantic Analysis
library(wordcloud) # to construct wordclouds
library(caret) # to make Confusion Matrix
library(gains) # to create Lift Chart and ROC curve
Before beginning, we need to copy and paste each of Trump’s Speeches from millercenter.org
When creating the .txt files, I excluded:
More specifically I excluded:
As a result, 40 .txt files were created.
The 40 .txt files were then separated into two folders, both containing 20 files.
One folder holds all speeches from 2017, 2018, and 2019.
Another folder holds all speeches from 2020 and 2021, which were busy years for speeches due to the global coronavirus pandemic as well as other things.
So let’s get the .txt speech files form both folders.
# Path to Folder 1 (2017-2019)
folder1 = "/Users/Simbo/Desktop/School/STAT415/Homework/Trump_Speeches/2017_2018_2019/"
# Getting names of all files in folder 1
folder1.files = list.files(folder1)
folder1.files = paste0(folder1, folder1.files)
# For each folder, read in all lines that are non-empty,
# not including the first three which conttain the title and date (not words spoken by Trump)
folder1.speeches <- folder1.files %>%
lapply(FUN=read_lines, skip = 3, skip_empty_rows = TRUE) %>%
lapply(FUN=function(f) paste(f, collapse = "")) %>%
unlist()
# Repeat with speeches in the second folder (2020-2021)
folder2 = "/Users/Simbo/Desktop/School/STAT415/Homework/Trump_Speeches/2020_2021/"
folder2.files = list.files(folder2)
folder2.files = paste0(folder2, folder2.files)
folder2.speeches <- folder2.files %>%
lapply(FUN=read_lines, skip = 3, skip_empty_rows = TRUE) %>%
lapply(FUN=function(f) paste(f, collapse = "")) %>%
unlist()
# Combine all years into one
speeches = c(folder1.speeches, folder2.speeches)
# Create labels: 0 = from folder 1 (2017-19), 1 = from folder 2 (2020-21)
label = rep(c(0, 1), c(length(folder1.speeches), length(folder2.speeches)))
In text mining, a corpus is a (often) large colleciton of text.
The function Corpus() from the tm library creates a Corpus from the documents in our folders.
Our Corpus will be labeled corp to be used throughout this project
corp <- VectorSource(speeches) %>% Corpus()
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
## <<TermDocumentMatrix (terms: 16103, documents: 40)>>
## Non-/sparse entries: 43975/600145
## Sparsity : 93%
## Maximal term length: 29
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 12 16 19 20 24 25 3 32 5 8
## and 173 201 187 217 206 270 313 373 164 165
## are 33 32 48 34 64 33 66 65 30 42
## for 54 71 33 48 62 51 87 84 43 57
## have 44 25 40 76 49 102 137 117 50 43
## our 98 60 96 64 91 27 100 77 62 93
## that 57 63 51 93 65 119 126 131 66 54
## the 219 279 218 194 273 326 295 434 183 258
## they 22 10 24 50 12 140 148 209 69 15
## will 53 27 34 12 34 9 29 69 25 29
## you 9 2 14 50 49 175 170 197 70 11
Tokenization is the process of taking text and separating it into “tokens” or terms.
In this case, we will remove white space, punctuation, and numbers using the tm_map() function.
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
Additionally many text mining softwares will have varying lists of words that can be removed without losing too much information. These words are called stopwords.
The tm library has a list of 174 stopwords, mostly consisting of pronouns, helping verbs, and prepositions.
These words might add noise and do have really high frequencies in our documents due to English sentence structures. We will rermove these words from our Corpus using the same tm_map() function.
corp <- tm_map(corp, removeWords, stopwords("english"))
Another technique for narrowing down the number of terms is to look at the frequency of terms in the documents.
For example, getting rid of terms that occur frequently across all documents or incredibly rare terms.
This leads to the idea of Term-Frequency-Inverse Document Frequency (TF-IDF). TF-IDF looks for documents with many occurrences of rare terms.
Once again different text mining softwares will have different ways fo computing TF-IDF. In the tm package, the function is weightTFIdf(). But before we can use this function, the Corpus needs to be turned into a Term Document Matrix.
tdm <- TermDocumentMatrix(corp)
dim(tdm)
## [1] 11199 40
inspect(tdm)
## <<TermDocumentMatrix (terms: 11199, documents: 40)>>
## Non-/sparse entries: 34784/413176
## Sparsity : 92%
## Maximal term length: 24
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 12 16 19 20 24 25 3 32 5 8
## — 0 0 3 13 2 220 125 0 6 1
## ’re 0 0 0 55 1 43 73 102 42 1
## ’s 0 2 0 38 6 97 63 109 30 0
## american 27 12 30 14 35 0 6 20 13 33
## and 4 6 12 67 18 115 122 103 54 6
## going 3 1 4 42 5 58 70 65 28 3
## great 13 8 13 22 9 48 44 62 23 6
## people 11 42 22 34 19 67 101 93 27 8
## want 9 5 9 25 6 32 74 60 26 1
## will 54 29 34 14 34 9 29 69 25 29
Now that we have our reduced Corpus as a Term Document Matrix, we can look at the Term-Frequency - Inverse Document Frequency (TF-IDF)
The terms are the phrases that are generally rare, but in a few speeches (denoted by the doc #) are used very frequently.
tfidf <- weightTfIdf(tdm)
inspect(tfidf)
## <<TermDocumentMatrix (terms: 11199, documents: 40)>>
## Non-/sparse entries: 34704/413256
## Sparsity : 92%
## Maximal term length: 24
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample :
## Docs
## Terms 10 12 13 19 24
## – 0.0000000000 0.035581609 0.033814487 0.0000000000 0.0000000000
## — 0.0016823971 0.000000000 0.000000000 0.0005521171 0.0002894112
## ’re 0.0002648485 0.000000000 0.000000000 0.0000000000 0.0001594602
## ’s 0.0000000000 0.000000000 0.000000000 0.0000000000 0.0009567614
## barrett 0.0000000000 0.000000000 0.000000000 0.0000000000 0.0000000000
## don’t 0.0000000000 0.000000000 0.000000000 0.0000000000 0.0008289827
## iran 0.0034796758 0.000000000 0.000000000 0.0019983868 0.0000000000
## lot 0.0000000000 0.000000000 0.000000000 0.0000000000 0.0000000000
## nations 0.0078307748 0.001568499 0.003875564 0.0002855387 0.0006735372
## virus 0.0000000000 0.000000000 0.000000000 0.0000000000 0.0000000000
## Docs
## Terms 26 30 31 36 8
## – 0.000000000 0.000000000 0.0014123650 0.0000000000 0.0000000000
## — 0.000000000 0.000000000 0.0028170064 0.0000000000 0.0001784853
## ’re 0.000000000 0.000000000 0.0000000000 0.0016604410 0.0001966842
## ’s 0.000000000 0.000000000 0.0010347456 0.0008302205 0.0000000000
## barrett 0.000000000 0.000000000 0.0000000000 0.1090878380 0.0000000000
## don’t 0.000000000 0.000000000 0.0013448277 0.0000000000 0.0002556246
## iran 0.000000000 0.000000000 0.0000000000 0.0000000000 0.0006460272
## lot 0.000000000 0.000000000 0.0000000000 0.0000000000 0.0000000000
## nations 0.001119728 0.000000000 0.0004856242 0.0000000000 0.0008307660
## virus 0.031346364 0.004262006 0.0022658108 0.0000000000 0.0000000000
Now let’s create a barplot and wordcloud to show the words that are rare in all of the speeches, but frequent in just a few speeches.
v = rowSums(inspect(tdm))
## <<TermDocumentMatrix (terms: 11199, documents: 40)>>
## Non-/sparse entries: 34784/413176
## Sparsity : 92%
## Maximal term length: 24
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 12 16 19 20 24 25 3 32 5 8
## — 0 0 3 13 2 220 125 0 6 1
## ’re 0 0 0 55 1 43 73 102 42 1
## ’s 0 2 0 38 6 97 63 109 30 0
## american 27 12 30 14 35 0 6 20 13 33
## and 4 6 12 67 18 115 122 103 54 6
## going 3 1 4 42 5 58 70 65 28 3
## great 13 8 13 22 9 48 44 62 23 6
## people 11 42 22 34 19 67 101 93 27 8
## want 9 5 9 25 6 32 74 60 26 1
## will 54 29 34 14 34 9 29 69 25 29
term = names(v)
occurrences = as.numeric(v)
data.frame(term, occurrences) %>%
ggplot(mapping = aes(x = term, y = occurrences)) +
geom_col()
wordcloud(term, occurrences, random.color = TRUE, colors=c("red", "blue"))
Now using the lsa package, we can perform Latent Semantic Analysis (LSA), which is a dimension reduction method for text data.
Simple LSA cannot extract meaning, but rather classifies and summarizes the text data for us to maybe interpret.
lsa.tfidf <- lsa(tfidf, dim = 3)
words.df <- as.data.frame(as.matrix(lsa.tfidf$dk))
words.df
## V1 V2 V3
## 1 -0.05102351 0.05482171 -0.0806071065
## 2 -0.10189353 0.12063965 -0.1118138483
## 3 -0.04302644 0.03124660 -0.0440969958
## 4 -0.04322897 0.03330719 -0.0473548506
## 5 -0.03735111 0.02817525 -0.0393914866
## 6 -0.04184048 0.04661754 -0.0571587306
## 7 -0.05547233 0.04058611 -0.0556360103
## 8 -0.04139433 0.03668848 -0.0450326236
## 9 -0.04506322 0.02808556 -0.0500809428
## 10 -0.04483540 0.04843958 -0.0497179615
## 11 -0.04285311 0.03171454 -0.0514196196
## 12 -0.05847241 0.04372035 -0.0735217321
## 13 -0.13211441 0.18342112 -0.5992217734
## 14 -0.03892498 0.02978269 -0.0407763558
## 15 -0.04441197 0.03483385 -0.0475124489
## 16 -0.04749302 0.04586913 -0.0670090476
## 17 -0.04313464 0.05108982 -0.0460703211
## 18 -0.04097318 0.04290354 -0.0468080721
## 19 -0.04312258 0.03827043 -0.0465637678
## 20 -0.04130346 0.02905263 -0.0409199341
## 21 -0.11671794 0.17785525 -0.4967186587
## 22 -0.07266571 0.09523233 -0.2196612726
## 23 -0.05728816 0.04297404 -0.0668835917
## 24 -0.05259556 0.02958615 -0.0494645199
## 25 -0.04288223 0.02708437 -0.0442598923
## 26 -0.05452260 0.05905837 -0.0685567827
## 27 -0.04887368 0.04498025 -0.0567552522
## 28 -0.05082735 0.02791702 -0.0424874136
## 29 -0.04944533 0.03530515 -0.0537599079
## 30 -0.09630091 0.12639236 -0.0959279431
## 31 -0.05415104 0.03036965 -0.0497909872
## 32 -0.04438702 0.03209880 -0.0445522414
## 33 -0.04536829 0.03406534 -0.0432989122
## 34 -0.04592037 0.03806380 -0.0441157329
## 35 -0.03917059 0.02919905 -0.0460048273
## 36 -0.87123622 -0.45941903 0.1199470573
## 37 -0.06722492 0.05813362 -0.0477721686
## 38 -0.31071756 0.77009310 0.4728881367
## 39 -0.12745066 0.20285322 0.0003989346
## 40 -0.04838010 0.04486497 -0.0428469821
Now let’s produce a confusion matrix by first splitting the model into two parts: a training data set and a validation data set
n = nrow(words.df)
# Create a training set with 60% of thedocuments
idx = sample(1:n)
training = idx[1:round(n*0.6)]
# Add labels
trainingData = cbind(label = label[training], words.df[training, ])
# Fit a generalized linear model
reg = glm(label ~ ., data = trainingData, family = "binomial")
# validation data
validData = cbind(label = label[-training], words.df[-training, ])
# Predict with validation data
pred = predict(reg, newdata = validData, type = "response") # Fitted probabilities
# Produce the confusion matrix
pred.label = factor((as.numeric(pred) > 0.5) * 1, levels = c(1,0))
act = label[-training] %>% factor(levels = c(1,0))
confusionMatrix(pred.label, act)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 5 4
## 0 1 6
##
## Accuracy : 0.6875
## 95% CI : (0.4134, 0.8898)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.4067
##
## Kappa : 0.3939
##
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.8333
## Specificity : 0.6000
## Pos Pred Value : 0.5556
## Neg Pred Value : 0.8571
## Prevalence : 0.3750
## Detection Rate : 0.3125
## Detection Prevalence : 0.5625
## Balanced Accuracy : 0.7167
##
## 'Positive' Class : 1
##